ISLR is recommended. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf
Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allow computers to find hiddent insights without being explicitly programmed where to look.
General machine learning process is 1. data acquision 2. data cleaning 3. model training & building 4. model testing 5. model deployment
Supervised learning are trained using labeled examples, such as input where the desired output is known.The learning lagorithm receives a set of inputs along with the corresponding correct outputs, and the algorith learns by comparing its actual output with correct outputs to find errors. It then modifies the model acordingly. - classification - regression - prediction - gradient boosting
Supervised learning is commonly used in application where historical data predicts likely future events.
Unsupervised learning is used against data that has no historical labels. The system is not told the “right answer”. The algorithm must figure out what is bening shown. The goal is to explore the data and find some structure within.
Unsupervised learning can find the main attributes that separate customer segments from each other. Popular techniques include self-organizing maps, nearest-neighbor mapping, k-means clustering and singular value decomposition.
Reinforcement learning is often used fro robotics, gaming and navigation. With reinforcement learning, te algorithm discovers through trial and error which actions yield the greatest rewards.
The linear model formula takes the form (y~x). To add more predictor variables, just use + sign i.e., (y~x+y).
Example - model structure model <- lm (log(PINCP,base=10) ~ AGEP + SEX + COW + SCHL, data=dtrain) - predict dtest\(predLogPINCP <- predict(model, newdata=dtest) dtrain\)predLogPINCP <- predict(model, newdata=dtrain)
Get the data from: http://archive.ics.uci.edu/ml/datasets/Student+Performance
# student math data
df <- read.csv("student-mat.csv",sep=";")
class(df)
## [1] "data.frame"
names(df)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
summary(df)
## school sex age address famsize Pstatus Medu
## GP:349 F:208 Min. :15.0 R: 88 GT3:281 A: 41 Min. :0.000
## MS: 46 M:187 1st Qu.:16.0 U:307 LE3:114 T:354 1st Qu.:2.000
## Median :17.0 Median :3.000
## Mean :16.7 Mean :2.749
## 3rd Qu.:18.0 3rd Qu.:4.000
## Max. :22.0 Max. :4.000
## Fedu Mjob Fjob reason
## Min. :0.000 at_home : 59 at_home : 20 course :145
## 1st Qu.:2.000 health : 34 health : 18 home :109
## Median :2.000 other :141 other :217 other : 36
## Mean :2.522 services:103 services:111 reputation:105
## 3rd Qu.:3.000 teacher : 58 teacher : 29
## Max. :4.000
## guardian traveltime studytime failures schoolsup
## father: 90 Min. :1.000 Min. :1.000 Min. :0.0000 no :344
## mother:273 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51
## other : 32 Median :1.000 Median :2.000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery higher internet romantic
## no :153 no :214 no :194 no : 81 no : 20 no : 66 no :263
## yes:242 yes:181 yes:201 yes:314 yes:375 yes:329 yes:132
##
##
##
##
## famrel freetime goout Dalc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000
## Median :4.000 Median :3.000 Median :3.000 Median :1.000
## Mean :3.944 Mean :3.235 Mean :3.109 Mean :1.481
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Walc health absences G1
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. : 3.00
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00
## Median :2.000 Median :4.000 Median : 4.000 Median :11.00
## Mean :2.291 Mean :3.554 Mean : 5.709 Mean :10.91
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00
## Max. :5.000 Max. :5.000 Max. :75.000 Max. :19.00
## G2 G3
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :10.71 Mean :10.42
## 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :20.00
# no na data is stored in the data frame
any(is.na(df))
## [1] FALSE
str(df)
## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
We plot data using ggplot2
library(ggplot2)
library(ggthemes)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# data correlation detection
# num only (numerical columns only)
num.cols <- sapply(df,is.numeric)
# filter
cor.data <- cor(df[,num.cols]) %>% print()
## age Medu Fedu traveltime
## age 1.000000000 -0.163658419 -0.163438069 0.070640721
## Medu -0.163658419 1.000000000 0.623455112 -0.171639305
## Fedu -0.163438069 0.623455112 1.000000000 -0.158194054
## traveltime 0.070640721 -0.171639305 -0.158194054 1.000000000
## studytime -0.004140037 0.064944137 -0.009174639 -0.100909119
## failures 0.243665377 -0.236679963 -0.250408444 0.092238746
## famrel 0.053940096 -0.003914458 -0.001369727 -0.016807986
## freetime 0.016434389 0.030890867 -0.012845528 -0.017024944
## goout 0.126963880 0.064094438 0.043104668 0.028539674
## Dalc 0.131124605 0.019834099 0.002386429 0.138325309
## Walc 0.117276052 -0.047123460 -0.012631018 0.134115752
## health -0.062187369 -0.046877829 0.014741537 0.007500606
## absences 0.175230079 0.100284818 0.024472887 -0.012943775
## G1 -0.064081497 0.205340997 0.190269936 -0.093039992
## G2 -0.143474049 0.215527168 0.164893393 -0.153197963
## G3 -0.161579438 0.217147496 0.152456939 -0.117142053
## studytime failures famrel freetime goout
## age -0.004140037 0.24366538 0.053940096 0.01643439 0.126963880
## Medu 0.064944137 -0.23667996 -0.003914458 0.03089087 0.064094438
## Fedu -0.009174639 -0.25040844 -0.001369727 -0.01284553 0.043104668
## traveltime -0.100909119 0.09223875 -0.016807986 -0.01702494 0.028539674
## studytime 1.000000000 -0.17356303 0.039730704 -0.14319841 -0.063903675
## failures -0.173563031 1.00000000 -0.044336626 0.09198747 0.124560922
## famrel 0.039730704 -0.04433663 1.000000000 0.15070144 0.064568411
## freetime -0.143198407 0.09198747 0.150701444 1.00000000 0.285018715
## goout -0.063903675 0.12456092 0.064568411 0.28501871 1.000000000
## Dalc -0.196019263 0.13604693 -0.077594357 0.20900085 0.266993848
## Walc -0.253784731 0.14196203 -0.113397308 0.14782181 0.420385745
## health -0.075615863 0.06582728 0.094055728 0.07573336 -0.009577254
## absences -0.062700175 0.06372583 -0.044354095 -0.05807792 0.044302220
## G1 0.160611915 -0.35471761 0.022168316 0.01261293 -0.149103967
## G2 0.135879999 -0.35589563 -0.018281347 -0.01377714 -0.162250034
## G3 0.097819690 -0.36041494 0.051363429 0.01130724 -0.132791474
## Dalc Walc health absences G1
## age 0.131124605 0.11727605 -0.062187369 0.17523008 -0.06408150
## Medu 0.019834099 -0.04712346 -0.046877829 0.10028482 0.20534100
## Fedu 0.002386429 -0.01263102 0.014741537 0.02447289 0.19026994
## traveltime 0.138325309 0.13411575 0.007500606 -0.01294378 -0.09303999
## studytime -0.196019263 -0.25378473 -0.075615863 -0.06270018 0.16061192
## failures 0.136046931 0.14196203 0.065827282 0.06372583 -0.35471761
## famrel -0.077594357 -0.11339731 0.094055728 -0.04435409 0.02216832
## freetime 0.209000848 0.14782181 0.075733357 -0.05807792 0.01261293
## goout 0.266993848 0.42038575 -0.009577254 0.04430222 -0.14910397
## Dalc 1.000000000 0.64754423 0.077179582 0.11190803 -0.09415879
## Walc 0.647544230 1.00000000 0.092476317 0.13629110 -0.12617921
## health 0.077179582 0.09247632 1.000000000 -0.02993671 -0.07317207
## absences 0.111908026 0.13629110 -0.029936711 1.00000000 -0.03100290
## G1 -0.094158792 -0.12617921 -0.073172073 -0.03100290 1.00000000
## G2 -0.064120183 -0.08492735 -0.097719866 -0.03177670 0.85211807
## G3 -0.054660041 -0.05193932 -0.061334605 0.03424732 0.80146793
## G2 G3
## age -0.14347405 -0.16157944
## Medu 0.21552717 0.21714750
## Fedu 0.16489339 0.15245694
## traveltime -0.15319796 -0.11714205
## studytime 0.13588000 0.09781969
## failures -0.35589563 -0.36041494
## famrel -0.01828135 0.05136343
## freetime -0.01377714 0.01130724
## goout -0.16225003 -0.13279147
## Dalc -0.06412018 -0.05466004
## Walc -0.08492735 -0.05193932
## health -0.09771987 -0.06133460
## absences -0.03177670 0.03424732
## G1 0.85211807 0.80146793
## G2 1.00000000 0.90486799
## G3 0.90486799 1.00000000
# data visualize of correlation
# package: corrgram,
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.4
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
# corrplot - quickly visualize correlation
corrplot(cor.data, method='color')
# corrgram - pass into the data frame directly
corrgram(df)
# help("corrgram")
# ggplot visualize
ggplot(df, aes(x=G3))+geom_histogram(bins = 20, alpha=0.5, fill='blue')
Split data into train and test set - we use install.packages(’caTools)
# caTools for data split
Math <- read.csv("student-mat.csv",sep=";")
head(Math)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 2 0 no yes yes
## 6 reputation mother 1 2 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 6 5 6 6
## 2 1 3 4 5 5 6
## 3 3 3 10 7 8 10
## 4 1 5 2 15 14 15
## 5 2 5 4 6 10 10
## 6 2 5 10 15 15 15
# install.packages('caTools')
library(caTools)
## Warning: package 'caTools' was built under R version 3.4.4
# set a seed
set.seed(101)
Approach (1): caTools package
sample <- sample.split(Math$G3,SplitRatio=0.7)
train <- subset(df,sample==TRUE)
test <- subset(Math,sample=FALSE)
Approach (2): simple version using sample
# alternative - split of training/test set based on ISLR
library(magrittr)
train_set <- sample(395,197)
train_set
## [1] 279 117 59 192 45 170 322 33 337 257 370 83 317 339 92 310 219
## [18] 174 20 232 373 217 127 163 80 347 359 6 148 266 168 189 301 327
## [35] 272 274 302 228 159 167 262 240 142 383 365 314 338 211 198 157 335
## [52] 271 35 104 79 394 97 363 186 180 115 242 237 101 133 258 144 259
## [69] 112 371 181 286 203 233 52 103 114 207 387 94 171 269 173 131 70
## [86] 226 16 90 12 368 214 183 307 145 201 77 36 22 197 151 350 209
## [103] 71 62 130 118 175 105 84 358 200 342 346 34 43 281 188 312 341
## [120] 220 216 42 47 121 178 106 323 23 1 251 378 75 111 328 372 205
## [137] 193 150 126 386 21 158 116 282 18 195 134 122 270 46 210 160 213
## [154] 31 67 138 275 140 32 343 351 110 238 14 318 28 352 4 227 329
## [171] 321 184 225 392 291 309 154 85 176 265 113 298 264 326 212 76 382
## [188] 222 10 384 285 331 247 353 244 147 88
print(head(Math,subset=train_set))
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 2 0 no yes yes
## 6 reputation mother 1 2 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 6 5 6 6
## 2 1 3 4 5 5 6
## 3 3 3 10 7 8 10
## 4 1 5 2 15 14 15
## 5 2 5 4 6 10 10
## 6 2 5 10 15 15 15
print(head(Math,subset=!train_set))
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 2 0 no yes yes
## 6 reputation mother 1 2 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 6 5 6 6
## 2 1 3 4 5 5 6
## 3 3 3 10 7 8 10
## 4 1 5 2 15 14 15
## 5 2 5 4 6 10 10
## 6 2 5 10 15 15 15
# linear model fit - all variables
lm.fit <- lm(G3~.,data=Math, subset=train_set)
lm.fit <- lm(G3~., data=train)
print(summary(lm.fit))
##
## Call:
## lm(formula = G3 ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4250 -0.6478 0.2844 1.0442 4.9840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.70763 2.69488 1.376 0.17019
## schoolMS 0.66981 0.47436 1.412 0.15926
## sexM 0.25730 0.29257 0.879 0.38006
## age -0.36163 0.12949 -2.793 0.00566 **
## addressU 0.08123 0.35652 0.228 0.81996
## famsizeLE3 0.12222 0.28709 0.426 0.67070
## PstatusT 0.06807 0.43032 0.158 0.87444
## Medu 0.11100 0.18757 0.592 0.55455
## Fedu -0.16373 0.15928 -1.028 0.30503
## Mjobhealth -0.63993 0.65314 -0.980 0.32820
## Mjobother -0.15730 0.42323 -0.372 0.71048
## Mjobservices -0.15872 0.46682 -0.340 0.73415
## Mjobteacher -0.04930 0.62335 -0.079 0.93702
## Fjobhealth 0.17565 0.83034 0.212 0.83265
## Fjobother -0.29559 0.56012 -0.528 0.59818
## Fjobservices -0.76964 0.59476 -1.294 0.19692
## Fjobteacher -0.27009 0.73824 -0.366 0.71480
## reasonhome -0.41126 0.31857 -1.291 0.19799
## reasonother 0.06767 0.45323 0.149 0.88144
## reasonreputation 0.13478 0.34735 0.388 0.69834
## guardianmother -0.05442 0.31663 -0.172 0.86369
## guardianother 0.01588 0.58375 0.027 0.97832
## traveltime -0.02353 0.19540 -0.120 0.90427
## studytime -0.04294 0.16910 -0.254 0.79979
## failures -0.17219 0.19668 -0.875 0.38220
## schoolsupyes 0.20742 0.42358 0.490 0.62481
## famsupyes -0.05329 0.27753 -0.192 0.84789
## paidyes 0.31311 0.28284 1.107 0.26941
## activitiesyes -0.26104 0.26687 -0.978 0.32901
## nurseryyes -0.05345 0.31236 -0.171 0.86428
## higheryes -0.94298 0.74005 -1.274 0.20385
## internetyes -0.15834 0.37029 -0.428 0.66932
## romanticyes -0.30048 0.28115 -1.069 0.28627
## famrel 0.36601 0.14609 2.505 0.01291 *
## freetime 0.08386 0.14247 0.589 0.55668
## goout -0.12457 0.13306 -0.936 0.35015
## Dalc -0.16995 0.20659 -0.823 0.41153
## Walc 0.21053 0.14963 1.407 0.16074
## health 0.07805 0.09341 0.836 0.40426
## absences 0.09547 0.02382 4.008 8.24e-05 ***
## G1 0.14259 0.07892 1.807 0.07206 .
## G2 0.98859 0.06929 14.267 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.962 on 235 degrees of freedom
## Multiple R-squared: 0.8456, Adjusted R-squared: 0.8187
## F-statistic: 31.39 on 41 and 235 DF, p-value: < 2.2e-16
# coefficient plot
library(coefplot)
coefplot(lm.fit)
# residual value extract
res <- residuals(lm.fit)
class(res)
## [1] "numeric"
res <- as.data.frame(res)
head(res)
## res
## 1 1.4684389
## 2 1.8826707
## 3 1.1866990
## 6 -2.2440152
## 9 0.5974865
## 11 0.8583062
# plot residual
plot(lm.fit$residuals)
ggplot(res,aes(x=res))+geom_histogram(fill="light blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Advanced visualization can be applied for model interprettion. 1. Residual vs Fitted 2. Normal Q-Q 3. Scale-location 4. Residual vs Leverage
plot(lm.fit)
The model is used for prediction using the following code.
# Prediction
pred <- predict(lm.fit,test)
# column bind
results <- cbind(pred,test$G3)
results <- as.data.frame(results)
colnames(results) <- c('predicted','actual')
head(results)
## predicted actual
## 1 4.531561 6
## 2 4.117329 6
## 3 8.813301 10
## 4 12.682507 15
## 5 9.433677 10
## 6 17.244015 15
plot(results$predicted,results$actual)
# take care of negative values
to_zero <- function(x){
if (x<0){
return(0)
}else{
return(x)
}
}
# apply zero function
results$pred <- sapply(results$pred,to_zero)
# mean squared error
MSE <- mean ((results$actual-results$predicted)^2) %>% print()
## [1] 3.523079
MSE
## [1] 3.523079
# RMSE
print(MSE^0.5)
## [1] 1.876987
SSE <- sum((results$predicted - results$actual)^2)
SST <- sum((mean(Math$G3 - results$actual)^2))
R2 <- 1 - SSE/SST
R2
## [1] -Inf
We first explore data to conduct a linear regression.
donwload the data or just use the supplied csv in te repository. The data has a following features: - datetime: hourly date + timestamp - season: 1 = sprint, 2=summer,3=fall,4=winter - holiday: whether the day - workingday - whether the day is neither a weekend nor holiday - weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp - temperature in Celsius - atemp - “feels like” temperature in Celsius - humidity - relative humidity - windspeed - wind speed - casual - number of non-registered user rentals initiated - registered - number of registered user rentals initiated - count - number of total rentals
bike <- read.csv('bikeshare.csv')
head(bike)
## datetime season holiday workingday weather temp atemp
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880
## humidity windspeed casual registered count
## 1 81 0.0000 3 13 16
## 2 80 0.0000 8 32 40
## 3 80 0.0000 5 27 32
## 4 75 0.0000 3 10 13
## 5 75 0.0000 0 1 1
## 6 75 6.0032 0 1 1
The purpose of our analysis is to predict count variable in te data set.
First we create a scatter plot of count vs temp. - count: number of total rentals - temp: temperature in Celsius
# ggplot visualization
library(ggplot2)
ggplot(data=bike,aes(x=temp,y=count))+geom_point(aes(color=temp,alpha=0.1))+geom_smooth()+theme_bw()
## `geom_smooth()` using method = 'gam'
Next, We plot count versus datetime as a scatterplot with a color gradient based on temparature. You’ll need to convert the datetime column into POSIXct before plotting.
library(ggplot2)
ggplot(bike,aes(x=datetime,y=count))+geom_point(aes(color=temp,alpha=0.2))+scale_color_continuous(low="blue",high="orange")+theme_bw()
# reference
# ggplot(bike,aes(datetime,count)) + geom_point(aes(color=temp),alpha=0.5) + scale_color_continuous(low='#55D8CE',high='#FF6E2E') +theme_bw()
Hopefully, we notice two things: (1) A seasonality to the data for winter and summer. (2) Bike rental counts are increasing in general.
The above might present a problem with using a linear regression model if the data is non-linear. Let’s have a quick overview of pros/cons of linear regression.
Pros: - Simple to explain - Highly interpretable - Model training and prediction are fast - No tuning is required (excluding regularization) - Features don’t need scaling - Can perform well with a small number of observations - Well-understood
Cons: - Assumes a linear relationship between the features and the response - Performance is (generally) not competitive with the best supervised learning methods due to high bias - Can’t automatically learn feature interactions
We’ll keep this in mind as we continue on. Maybe when we lear more algorithms, we can come back to this with some new tools, for now we’ll stick to linear regression.
Next, we will calculate correlation between temp and count.
cor.data <- cor(bike[,c("temp","count")])
cor.data
## temp count
## temp 1.0000000 0.3944536
## count 0.3944536 1.0000000
library(corrgram)
library(corrplot)
# corrplot - quickly visualize correlation
corrplot(cor.data, method='color')
# corrgram - pass into the data frame directly
corrgram(bike)
# help("corrgram")
Let’s explore te season data. Create a boxplot, with the y axis indicating count and the x axis begin a box for each season.
ggplot(bike,aes(x=factor(season),y=count))+geom_boxplot(aes(colour=factor(season)))
With this visualization, a line cannot capture a non-linear relatinship. Plus, there are more rentals in winter than in spring. We know of thse issues because of the growth of rental count, this isn’t due to the actual season.
A lot of times you’ll need to use domain knowledge and experience to engineer and create new features. Let’s go ahead and engineer some new features from the datetime column.
First, we create an hour column that takes the hour from the datetime column. You’ll probably need to apply some function to the entire datetime column and reassign it.
Hint: time.stamp = bike$datetime[4], format(time.stamp,“%H”)
library(magrittr)
time.stamp <- bike$datetime[4] %>% as.Date() %>% print()
## [1] "2011-01-01"
format(time.stamp,"%H")
## [1] "00"
#as.POSIXct() function - CONVERT to POSIXct()
bike$datetime <- as.POSIXct(bike$datetime)
# sapply()
bike$hour <- sapply(bike$datetime,function(x){format(x,"%H")})
head(bike)
## datetime season holiday workingday weather temp atemp
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880
## humidity windspeed casual registered count hour
## 1 81 0.0000 3 13 16 00
## 2 80 0.0000 8 32 40 01
## 3 80 0.0000 5 27 32 02
## 4 75 0.0000 3 10 13 03
## 5 75 0.0000 0 1 1 04
## 6 75 6.0032 0 1 1 05
Now we crate a scatterplot count versus hour, with color scale based on temp. Only use bike data where workingday==1. Optional additions: - use the additional layer: scale_color_gradient(colors=c(‘color1’,color2)) - Use position=positio_jitter(w=1,w=0) inside of geom_point and check out what it does.
library(dplyr)
library(ggplot2)
pl <- ggplot(filter(bike,workingday==1),aes(hour,count))
pl <- pl + geom_point(position=position_jitter(w=1, h=0),aes(color=temp),alpha=0.5)
pl <- pl + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
pl + theme_bw()
Now, we create the same plot for non working days:
pl <- ggplot(filter(bike,workingday==0),aes(hour,count))
pl <- pl + geom_point(position=position_jitter(w=1, h=0),aes(color=temp),alpha=0.8)
pl <- pl + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
pl + theme_bw()
Working days have peak activity during the morning (~8am) and right after work gets out (~5pm), with some lunchtime activity. While the non-work days have a steady rise and fall for the afternoon. Now let’s continue by trying to build a model, we’ll begin by just looking at a single feature.
We use lm() to build a model that predicts count based solely on the temp feature, name it temp.model.
temp.model <- lm(count~temp,data=bike)
summary(temp.model)
##
## Call:
## lm(formula = count ~ temp, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.32 -112.36 -33.36 78.98 741.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0462 4.4394 1.362 0.173
## temp 9.1705 0.2048 44.783 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared: 0.1556, Adjusted R-squared: 0.1555
## F-statistic: 2006 on 1 and 10884 DF, p-value: < 2.2e-16
We got 6.0462 as the intercept and 9.17 as the temp coefficient.
Next question is how many bike rentals would we predict if temperature was 25 degrees Celsius? Calculate this two was: (1) Use the values we just got above (2) Use predict() function
6.0462 + 9.17*25
## [1] 235.2962
predict(temp.model,data.frame(temp=c(25)))
## 1
## 235.3097
WE use sapply as.numeric to change the hour column to a column of numeric values.
bike$hour <- sapply(bike$hour,as.numeric)
Finally, we build a model that attemps to predict count based off the following features. Figure out if there is a way not to pass/write all these variables into the lm() funcion.
lm.model2 <- lm(count~.-casual-registered-datetime-atemp,data=bike)
summary(lm.model2)
##
## Call:
## lm(formula = count ~ . - casual - registered - datetime - atemp,
## data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -324.61 -96.88 -31.01 55.27 688.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.91369 8.45147 5.551 2.91e-08 ***
## season 21.70333 1.35409 16.028 < 2e-16 ***
## holiday -10.29914 8.79069 -1.172 0.241
## workingday -0.71781 3.14463 -0.228 0.819
## weather -3.20909 2.49731 -1.285 0.199
## temp 7.01953 0.19135 36.684 < 2e-16 ***
## humidity -2.21174 0.09083 -24.349 < 2e-16 ***
## windspeed 0.20271 0.18639 1.088 0.277
## hour 7.61283 0.21688 35.102 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared: 0.3344, Adjusted R-squared: 0.3339
## F-statistic: 683 on 8 and 10877 DF, p-value: < 2.2e-16
library(coefplot)
coefplot(lm.model2)
A linear model like this one which uses OLS won’t be able to take into account seasonality of the data, and will get thrown in the dataset, accidentally attributing it towards the winter season, instead of realizing its just overall demand growing. Later on, we’ll see if other models may be a better fit for this sort of data.
Logistic regression is used as method for classification examples - spam versus ham emails - loan default (yes,no) - disease diagnosis
We have seen regression problesm where we try to predict continuous values. Although the name maybe confusion at first, logistic regression asslows us to solve classification problems, where we are trying to predict discrete categories. The convention for binary classification is to have two classes 0 and 1.
The sigmoid (aka logistic) function takes in any value and outputs it to be between 0 and 1. The sigmoid function is defined as below. \[ \phi(z) = \frac{1}{1+e^{-z}} \] This means we can take our linear regression solution and place it into the Sigmoid function. We can set a cutoff point at 0.5, anything below it results class 0, anything above is class 1.
We use the logistic function to output a value ranging from 0 to 1. Based off of this probability we assign a class.
We can use a confusion matrix to evaluate our model. Testing for disease - Type I error: false positive - Type II error: false negative
We use titanic data for applying logistic regression model. - Passenger ID - pclass: Passenger class - SibSp: Number of siblings on board/spouses - Parch: parents/children on board - Fare: ticket fare - Embarked: Queens Town
titanic.train <- read.csv('titanic_train.csv')
head(titanic.train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
str(titanic.train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
dim(titanic.train)
## [1] 891 12
By using Amelia’s missmap package, we can see that age variable has multiple missing values.
# install.packages('Amelia')
# library(Amelia)
# missmap(titanic.train,main='Missing Map',col=c('yellow','black'),legend = FALSE)
Simply by bar chart, we can see the data distribution. - third class > first class > second class - most passengers are 20s and 30s (children on board)
# Pclass distribution
ggplot(titanic.train,aes(x=Pclass))+geom_bar(aes(fill=factor(Pclass)))
# Sex distribution
ggplot(titanic.train,aes(x=Sex))+geom_bar(aes(fill=factor(Sex)))
# age distribution
ggplot(titanic.train,aes(x=Age))+geom_histogram(bins=20,alpha=0.3,fill='dark blue')
## Warning: Removed 177 rows containing non-finite values (stat_bin).
# SibSp - no Siblings/Spouses
ggplot(titanic.train,aes(SibSp))+geom_bar()
ggplot(titanic.train,aes(x=Fare))+geom_histogram(fill="light blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Options to deal with missing values for clean and tidy data are: (1) exclude all rows with missing value * For this case, the 170 rows are too much for exclusion. (2) Supplement missing values with some values e.g., fill in average age by passenger class
We will practice option 2.
pl <- ggplot(titanic.train,aes(x=Pclass,y=Age))
pl <- pl + geom_boxplot(aes(group=Pclass,fill=factor(Pclass),alpha=0.4))
pl
## Warning: Removed 177 rows containing non-finite values (stat_boxplot).
pl + scale_y_continuous(breaks~seq(min(0),max(80),by=2))
## Warning: Removed 177 rows containing non-finite values (stat_boxplot).
impute_age <- function(age,class){
out <- age
for (i in 1:length(age)){
if (is.na(age[i])){
if (class[i] == 1){
out[i] <- 37
}else if (class[i] == 2){
out[i] <- 29
}else{
out[i] <- 24
}
}else{
out[i]<-age[i]
}
}
return(out)
}
As continued from the previous lecture, we supplement average age values in the N/A rows in the train set.By doing this, we don’t have any row with missing values.
fixed.ages <- impute_age(titanic.train$Age,titanic.train$Pclass)
titanic.train$Age <- fixed.ages
# missmap(titanic.train,main="Imputation Check") #N/A values wre supplemented
As we do have several variables out of the analysis, we exclude those variables from the dataset.
str(titanic.train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 24 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
library(dplyr)
titanic.train <- select(titanic.train,-PassengerId,-Name,-Ticket,-Cabin)
str(titanic.train)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 24 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
dim(titanic.train)
## [1] 891 8
titanic.train$Survived <- as.factor(titanic.train$Survived)
titanic.train$Pclass <- as.factor(titanic.train$Pclass)
titanic.train$Parch <- as.factor(titanic.train$Parch)
titanic.train$SibSp <- as.factor(titanic.train$SibSp)
str(titanic.train)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 24 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
log.model <- glm(Survived~.,family=binomial(link='logit'),data=titanic.train)
summary(log.model)
##
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"),
## data = titanic.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8158 -0.6134 -0.4138 0.5808 2.4896
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.845e+01 1.660e+03 0.011 0.991134
## Pclass2 -1.079e+00 3.092e-01 -3.490 0.000484 ***
## Pclass3 -2.191e+00 3.161e-01 -6.930 4.20e-12 ***
## Sexmale -2.677e+00 2.040e-01 -13.123 < 2e-16 ***
## Age -3.971e-02 8.758e-03 -4.534 5.79e-06 ***
## SibSp1 8.135e-02 2.245e-01 0.362 0.717133
## SibSp2 -2.897e-01 5.368e-01 -0.540 0.589361
## SibSp3 -2.241e+00 7.202e-01 -3.111 0.001862 **
## SibSp4 -1.675e+00 7.620e-01 -2.198 0.027954 *
## SibSp5 -1.595e+01 9.588e+02 -0.017 0.986731
## SibSp8 -1.607e+01 7.578e+02 -0.021 0.983077
## Parch1 3.741e-01 2.895e-01 1.292 0.196213
## Parch2 3.862e-02 3.824e-01 0.101 0.919560
## Parch3 3.655e-01 1.056e+00 0.346 0.729318
## Parch4 -1.586e+01 1.055e+03 -0.015 0.988007
## Parch5 -1.152e+00 1.172e+00 -0.983 0.325771
## Parch6 -1.643e+01 2.400e+03 -0.007 0.994536
## Fare 2.109e-03 2.490e-03 0.847 0.397036
## EmbarkedC -1.458e+01 1.660e+03 -0.009 0.992995
## EmbarkedQ -1.456e+01 1.660e+03 -0.009 0.993001
## EmbarkedS -1.486e+01 1.660e+03 -0.009 0.992857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 763.41 on 870 degrees of freedom
## AIC: 805.41
##
## Number of Fisher Scoring iterations: 15
library(coefplot)
coefplot(log.model)
Next, we predict test data by splitting train data.
library(caTools)
set.seed(101)
split <- sample.split(titanic.train$Survived,SplitRatio = 0.7)
final.train <- subset(titanic.train,split==TRUE)
final.test <- subset(titanic.train,splot=TRUE)
final.log.model <- glm(Survived~.,family=binomial(link='logit'),data=final.train)
summary(final.log.model)
##
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"),
## data = final.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8288 -0.5607 -0.4096 0.6174 2.4898
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.777e+01 2.400e+03 0.007 0.994091
## Pclass2 -1.230e+00 3.814e-01 -3.225 0.001261 **
## Pclass3 -2.160e+00 3.841e-01 -5.624 1.87e-08 ***
## Sexmale -2.660e+00 2.467e-01 -10.782 < 2e-16 ***
## Age -3.831e-02 1.034e-02 -3.705 0.000212 ***
## SibSp1 -2.114e-02 2.755e-01 -0.077 0.938836
## SibSp2 -4.000e-01 6.463e-01 -0.619 0.536028
## SibSp3 -2.324e+00 8.994e-01 -2.584 0.009765 **
## SibSp4 -1.196e+00 8.302e-01 -1.440 0.149839
## SibSp5 -1.603e+01 9.592e+02 -0.017 0.986666
## SibSp8 -1.633e+01 1.004e+03 -0.016 0.987019
## Parch1 7.290e-01 3.545e-01 2.056 0.039771 *
## Parch2 1.406e-01 4.504e-01 0.312 0.754892
## Parch3 7.919e-01 1.229e+00 0.645 0.519226
## Parch4 -1.498e+01 1.552e+03 -0.010 0.992300
## Parch5 -9.772e-03 1.378e+00 -0.007 0.994343
## Parch6 -1.635e+01 2.400e+03 -0.007 0.994563
## Fare 3.128e-03 3.091e-03 1.012 0.311605
## EmbarkedC -1.398e+01 2.400e+03 -0.006 0.995353
## EmbarkedQ -1.387e+01 2.400e+03 -0.006 0.995386
## EmbarkedS -1.431e+01 2.400e+03 -0.006 0.995243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 829.60 on 622 degrees of freedom
## Residual deviance: 530.63 on 602 degrees of freedom
## AIC: 572.63
##
## Number of Fisher Scoring iterations: 15
fitted.probabilities <- predict(final.log.model,final.test,type='response')
fitted.results <- ifelse(fitted.probabilities>0.5,1,0)
misClassError <- mean(fitted.results!=final.test$Survived)
print(1-misClassError)
## [1] 0.8103255
# confusion matrix
table(final.test$Survived,fitted.probabilities>0.5)
##
## FALSE TRUE
## 0 475 74
## 1 95 247
In this project, we will be owrking with the UCI adult dataset. We will be attempting to predict if people in the data set belong in a certain class by salary, either making <= 50k or >50k per year.
Typically, most of your time is spent cleaning data, not running the few lines of code that build your model, this proect will try to refrlect that by showing different issues that may arise when cleaning data.
We first read in the adult.csv file and set it to a data frame called adult.
adult <- read.csv('adult_sal.csv')
head(adult)
## X age type_employer fnlwgt education education_num marital
## 1 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 3 38 Private 215646 HS-grad 9 Divorced
## 4 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hr_per_week country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
colnames(adult)
## [1] "X" "age" "type_employer" "fnlwgt"
## [5] "education" "education_num" "marital" "occupation"
## [9] "relationship" "race" "sex" "capital_gain"
## [13] "capital_loss" "hr_per_week" "country" "income"
The index has be repeated, thus we clean the data set by dropping this column, and check data overview.
library(dplyr)
adult <- select(adult,-X)
# data overview
head(adult)
## age type_employer fnlwgt education education_num marital
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hr_per_week country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
str(adult)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ country : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
summary(adult)
## age type_employer fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education_num marital
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty :4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair :4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial:4066 Other-relative: 981 Black : 3124
## Adm-clerical :3770 Own-child : 5068 Other : 271
## Sales :3650 Unmarried : 3446 White :27816
## Other-service :3295 Wife : 1568
## (Other) :9541
## sex capital_gain capital_loss hr_per_week
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## country income
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## ? : 583
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
Notice that we have a lot of columns that are categorical factors, however, a lot of these columns have too many factors than may be necessary. In this data cleaning section, we’ll try to clean these columns up by reducing the number of factors.
First, we use table() to check the frequency of the type_employer column. - The number of Null values in the type_employer: 1836 - Two smallest groups: Never-worked, Without-pay
table(adult$type_employer)
##
## ? Federal-gov Local-gov Never-worked
## 1836 960 2093 7
## Private Self-emp-inc Self-emp-not-inc State-gov
## 22696 1116 2541 1298
## Without-pay
## 14
We combine these two smallest groups into a single group called “Unemployed”. There are lots of ways to do this, so feel free to get creative. Hint: it maybe helpful to convert these objects into character data types (as.character() and then use sapply with a custom function)
unemp <- function(job){
job <- as.character(job)
if (job=='Never-worked' | job=='Without-pay'){
return('Unemployed')
} else{}
return(job)
}
adult$type_employer <- sapply(adult$type_employer,unemp)
table(adult$type_employer)
##
## ? Federal-gov Local-gov Private
## 1836 960 2093 22696
## Self-emp-inc Self-emp-not-inc State-gov Unemployed
## 1116 2541 1298 21
What other columns are suitable for combining? Combine state and local gov jobs into a category called SL-gov and combine self-employed jobs into a categorry called self-emp.
group_emp <- function(job){
if(job=='local-gov' |job=='State-gov'){
return('SL-gov')
} else if (job=='Self-emp-inc' | job=='Self-emp-not-inc'){
return('self-emp')
}else{
return(job)
}
}
adult$type_employer <- sapply(adult$type_employer,group_emp)
table(adult$type_employer)
##
## ? Federal-gov Local-gov Private self-emp SL-gov
## 1836 960 2093 22696 3657 1298
## Unemployed
## 21
Next, we use table() look at the marital column.
table(adult$marital)
##
## Divorced Married-AF-spouse Married-civ-spouse
## 4443 23 14976
## Married-spouse-absent Never-married Separated
## 418 10683 1025
## Widowed
## 993
let’s reduce this to three groups:
group_marital <- function(marital){
marrital <- as.character(marital)
# not-married
if(marital=='Separated' | marital=='Divorced' | marital=='Widowed'){
return('Non-Married')
# never-married
} else if (marital=='Never-married'){
return('Never-Married')
# married
} else{
return('Married')
}
}
adult$marital <- sapply(adult$marital,group_marital)
table(adult$marital)
##
## Married Never-Married Non-Married
## 15417 10683 6461
We check the country column using table()
table(adult$country)
##
## ? Cambodia
## 583 19
## Canada China
## 121 75
## Columbia Cuba
## 59 95
## Dominican-Republic Ecuador
## 70 28
## El-Salvador England
## 106 90
## France Germany
## 29 137
## Greece Guatemala
## 29 64
## Haiti Holand-Netherlands
## 44 1
## Honduras Hong
## 13 20
## Hungary India
## 13 100
## Iran Ireland
## 43 24
## Italy Jamaica
## 73 81
## Japan Laos
## 62 18
## Mexico Nicaragua
## 643 34
## Outlying-US(Guam-USVI-etc) Peru
## 14 31
## Philippines Poland
## 198 60
## Portugal Puerto-Rico
## 37 114
## Scotland South
## 12 80
## Taiwan Thailand
## 51 18
## Trinadad&Tobago United-States
## 19 29170
## Vietnam Yugoslavia
## 67 16
Group these contires together however we see fit. We have flexibilityhere because there is no right/wrong way to do this, possible group by continents. We should be able to reduce the number of groups here significantly though.
levels(adult$country)
## [1] "?" "Cambodia"
## [3] "Canada" "China"
## [5] "Columbia" "Cuba"
## [7] "Dominican-Republic" "Ecuador"
## [9] "El-Salvador" "England"
## [11] "France" "Germany"
## [13] "Greece" "Guatemala"
## [15] "Haiti" "Holand-Netherlands"
## [17] "Honduras" "Hong"
## [19] "Hungary" "India"
## [21] "Iran" "Ireland"
## [23] "Italy" "Jamaica"
## [25] "Japan" "Laos"
## [27] "Mexico" "Nicaragua"
## [29] "Outlying-US(Guam-USVI-etc)" "Peru"
## [31] "Philippines" "Poland"
## [33] "Portugal" "Puerto-Rico"
## [35] "Scotland" "South"
## [37] "Taiwan" "Thailand"
## [39] "Trinadad&Tobago" "United-States"
## [41] "Vietnam" "Yugoslavia"
Asia <- c('China','Hong','India','Iran','Cambodia','Japan', 'Laos' ,
'Philippines' ,'Vietnam' ,'Taiwan', 'Thailand')
North.America <- c('Canada','United-States','Puerto-Rico' )
Europe <- c('England' ,'France', 'Germany' ,'Greece','Holand-Netherlands','Hungary',
'Ireland','Italy','Poland','Portugal','Scotland','Yugoslavia')
Latin.and.South.America <- c('Columbia','Cuba','Dominican-Republic','Ecuador',
'El-Salvador','Guatemala','Haiti','Honduras',
'Mexico','Nicaragua','Outlying-US(Guam-USVI-etc)','Peru',
'Jamaica','Trinadad&Tobago')
Other <- c('South')
library(magrittr)
group_country <- function(ctry){
if (ctry %in% Asia){
return('Asia')
}else if (ctry %in% North.America){
return('North.America')
}else if (ctry %in% Europe){
return('Europe')
}else if (ctry %in% Latin.and.South.America){
return('Latin.and.South.America')
}else{
return('Other')
}
}
adult$country <- sapply(adult$country,group_country)
table(adult$country)
##
## Asia Europe Latin.and.South.America
## 671 521 1301
## North.America Other
## 29405 663
Once columns are cleared, we check the str() of adult again, and make sure any of the columns we changed have factor levels with factor().
adult$type_employer <- as.factor(adult$type_employer)
adult$country <- sapply(adult$country,factor)
adult$marital <- sapply(adult$marital,factor)
str(adult)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: Factor w/ 7 levels "?","Federal-gov",..: 6 5 4 4 4 4 4 5 4 4 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : Factor w/ 3 levels "Never-Married",..: 1 2 3 2 2 2 2 2 1 2 ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ country : Factor w/ 5 levels "North.America",..: 1 1 1 1 2 1 2 1 1 1 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Notice how we have data that is missing.
We forst convert any celss with a ‘?’ or a ‘?’ value to a NA value. Hint: is.na maybe useful here or we can also use brackets with a conditional statement.
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.4.4
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
adult[adult=='?'] <- NA
Using table() on a column with NA values, we confirm those NA values are not shown, instead we will just see 0 for ?.
table(adult$type_emloyer)
## < table of extent 0 >
adult$type_employer <- sapply(adult$type_employer,factor)
adult$country <- sapply(adult$country,factor)
adult$marital <- sapply(adult$marital,factor)
adult$occupation <- sapply(adult$occupation,factor)
We could have also done something like the below. adult\(type_employer = factor(adult\)type_employer)
We check the distribution of missing values.USing missmap, we notice that heatmap pointing out missing values(NA). This gives us a quick glance at how much data is missing, in this case, not a whole lot (relatively speaking). We probably also notice that there is a bunch of y labels, get rid of them by running the command below. What is col=c(‘yellow’,‘black’) doing?
library(Amelia)
missmap(adult)
missmap(adult,y.at=c(1),y.labels = c(''),col=c('yellow','black'))
Next, we use na.omit() to omit NA data from the adult data frame. Note, it really depends on the situation and the data to judge whether or not this is a good decision. We shouldn’t always drop NA values.
adult <- na.omit(adult)
str(adult)
## 'data.frame': 30718 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: Factor w/ 6 levels "SL-gov","self-emp",..: 1 2 3 3 3 3 3 2 3 3 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : Factor w/ 3 levels "Never-Married",..: 1 2 3 2 2 2 2 2 1 2 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 2 3 3 4 2 5 2 4 2 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ country : Factor w/ 5 levels "North.America",..: 1 1 1 1 2 1 2 1 1 1 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1843] 28 62 70 78 107 129 150 155 161 188 ...
## .. ..- attr(*, "names")= chr [1:1843] "28" "62" "70" "78" ...
We again use missmap() to check that all the NA values were in fact dropped.
missmap(adult,y.at=c(1),y.labels=c(""),col=c("yellow","black"))
Although we’ve cleaned the data, we still have explored it using visualization. We use ggplot2 to create a histogram of ages, colored by income.
str(adult)
## 'data.frame': 30718 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: Factor w/ 6 levels "SL-gov","self-emp",..: 1 2 3 3 3 3 3 2 3 3 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : Factor w/ 3 levels "Never-Married",..: 1 2 3 2 2 2 2 2 1 2 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 2 3 3 4 2 5 2 4 2 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ country : Factor w/ 5 levels "North.America",..: 1 1 1 1 2 1 2 1 1 1 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1843] 28 62 70 78 107 129 150 155 161 188 ...
## .. ..- attr(*, "names")= chr [1:1843] "28" "62" "70" "78" ...
library(ggplot2)
library(dplyr)
ggplot(adult,aes(x=age))+geom_histogram(aes(fill=income),color='black',binwidth=1)+theme_bw()
We plot a histgram of hours worked per week.
ggplot(adult,aes(x=hr_per_week))+geom_histogram(fill="light blue")+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We rename the country column to better reflect factor levels.
# lots of ways to do this, could use dplyr as well
names(adult)[names(adult)=="country"] <- "region"
str(adult)
## 'data.frame': 30718 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: Factor w/ 6 levels "SL-gov","self-emp",..: 1 2 3 3 3 3 3 2 3 3 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : Factor w/ 3 levels "Never-Married",..: 1 2 3 2 2 2 2 2 1 2 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 2 3 3 4 2 5 2 4 2 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ region : Factor w/ 5 levels "North.America",..: 1 1 1 1 2 1 2 1 1 1 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1843] 28 62 70 78 107 129 150 155 161 188 ...
## .. ..- attr(*, "names")= chr [1:1843] "28" "62" "70" "78" ...
We create a barplot of region with the fill color defined by income class. Optional: Figure out how rotate the x axis text for readability.
ggplot(adult,aes(x=region))+geom_bar(aes(fill=income),color='black')+theme_bw()+theme(axis.text.x=element_text(angle=90,hjust=1))
Now it’s time to build a model to classify people into two groups: above or below 50k in Salary.
Details should be refered to the lecture of ISLR if we are fuzzy on any of this.
Logistic regression is a type of classification model. In classification models, we attempt to predict the outcome of categorical depenedent variables, using one or more independent variables. The independent variables can be categorical or numerical.
Logistic regression is based on the logistic function, which always takes values between 0 and 1. Replacing the dependent variable of the logistic function with a linear combination of dependent variables we intend to use for regression, we arrive at the formula for logstic
We take a quick loo at the head() of adult to make sure we have a good overview before going into building the model.
head(adult)
## age type_employer fnlwgt education education_num marital
## 1 39 SL-gov 77516 Bachelors 13 Never-Married
## 2 50 self-emp 83311 Bachelors 13 Married
## 3 38 Private 215646 HS-grad 9 Non-Married
## 4 53 Private 234721 11th 7 Married
## 5 28 Private 338409 Bachelors 13 Married
## 6 37 Private 284582 Masters 14 Married
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hr_per_week region income
## 1 40 North.America <=50K
## 2 13 North.America <=50K
## 3 40 North.America <=50K
## 4 40 North.America <=50K
## 5 40 Latin.and.South.America <=50K
## 6 40 North.America <=50K
As a first step, we split the data into a train and test set using the caTools library as done in the previous lectures. Reference previous solution notebooks if we need a refresher.
# caTools library
library(caTools)
# set a randome seed
set.seed(101)
# spliit up the sample, basically randomlly assigns a booleans to a new column sample
sample <- sample.split(adult$income,SplitRatio = 0.7)
# Training data
train = subset(adult,sample==TRUE)
# Testing data
test=subset(adult,sample==FALSE)
head(train)
## age type_employer fnlwgt education education_num marital
## 1 39 SL-gov 77516 Bachelors 13 Never-Married
## 2 50 self-emp 83311 Bachelors 13 Married
## 4 53 Private 234721 11th 7 Married
## 5 28 Private 338409 Bachelors 13 Married
## 6 37 Private 284582 Masters 14 Married
## 7 49 Private 160187 9th 5 Married
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## 7 Other-service Not-in-family Black Female 0 0
## hr_per_week region income
## 1 40 North.America <=50K
## 2 13 North.America <=50K
## 4 40 North.America <=50K
## 5 40 Latin.and.South.America <=50K
## 6 40 North.America <=50K
## 7 16 Latin.and.South.America <=50K
head(test)
## age type_employer fnlwgt education education_num marital
## 3 38 Private 215646 HS-grad 9 Non-Married
## 15 40 Private 121772 Assoc-voc 11 Married
## 17 25 self-emp 176756 HS-grad 9 Never-Married
## 18 32 Private 186824 HS-grad 9 Never-Married
## 19 38 Private 28887 11th 7 Married
## 22 54 Private 302146 HS-grad 9 Non-Married
## occupation relationship race sex capital_gain
## 3 Handlers-cleaners Not-in-family White Male 0
## 15 Craft-repair Husband Asian-Pac-Islander Male 0
## 17 Farming-fishing Own-child White Male 0
## 18 Machine-op-inspct Unmarried White Male 0
## 19 Sales Husband White Male 0
## 22 Other-service Unmarried Black Female 0
## capital_loss hr_per_week region income
## 3 0 40 North.America <=50K
## 15 0 40 Other >50K
## 17 0 35 North.America <=50K
## 18 0 40 North.America <=50K
## 19 0 50 North.America <=50K
## 22 0 20 North.America <=50K
glm is used to fit generalized linear models, specified by giving a symolic description of the linear predictor and a description of the error distribution.
glm(formula,family=gaussian,data,weights,subset…)
A typical predictor has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with any duplicates removed.
A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.
The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula.
Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations. For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes: they would rarely be used for a Poisson GLM.
glm.fit is the workhorse function: it is not normally called directly but can be more efficient where the response vector, design matrix and family have already been calculated.
If more than one of etastart, start and mustart is specified, the first in the list will be used. It is often advisable to supply starting values for a quasi family, and also for families with unusual links such as gaussian(“log”).
All of weights, subset, offset, etastart and mustart are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.
For the background to warning messages about ‘fitted probabilities numerically 0 or 1 occurred’ for binomial GLMs, see Venables & Ripley (2002, pp. 197–8).
glm returns an object of class inheriting from “glm” which inherits from the class “lm”. See later in this section. If a non-standard method is used, the object will also inherit from the class (if any) returned by that function.
The function summary (i.e., summary.glm) can be used to obtain or print a summary of the results and the function anova (i.e., anova.glm) to produce an analysis of variance table.
The generic accessor functions coefficients, effects, fitted.values and residuals can be used to extract various useful features of the value returned by glm.
weights extracts a vector of weights, one for each case in the fit (after subsetting and na.action).
An object of class “glm” is a list containing at least the following components: - coefficient - residual - fitted.values - rank - family - linear.predictors - deviance - aic - null.deviance…
In addition, non-empty fits will have components qr, R and effects relating to the final weighted linear fit.
Objects of class “glm” are normally of class c(“glm”,“lm”), that is inherit from class “lm”, and well-designed methods for class “lm” will be applied to the weighted linear model at the final iteration of IWLS. However, care is needed, as extractor functions for class “glm” such as residuals and weights do not just pick out the component of the fit with the same name.
If a binomial glm model was specified by giving a two-column response, the weights returned by prior.weights are the total numbers of cases (factored by the supplied case weights) and the component y of the result is the proportion of successes.
The argument method serves two purposes. One is to allow the model frame to be recreated with no fitting. The other is to allow the default fitting function glm.fit to be replaced by a function which takes the same arguments and uses a different fitting algorithm. If glm.fit is supplied as a character string it is used to search for a function of that name, starting in the stats namespace.
The class of the object return by the fitter (if any) will be prepended to the class returned by glm.
We use all the features to train a glm() model on the training data set, pass the argument family=binomial(logit) into the glm function.
adult.model = glm(income ~., family=binomial(logit),data=train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(adult.model)
##
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1169 -0.5168 -0.1966 0.0000 3.6269
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.346e+00 4.307e-01 -17.056 < 2e-16 ***
## age 2.536e-02 2.008e-03 12.629 < 2e-16 ***
## type_employerself-emp -1.266e-02 1.208e-01 -0.105 0.916508
## type_employerPrivate 2.170e-01 1.088e-01 1.994 0.046190 *
## type_employerFederal-gov 6.634e-01 1.500e-01 4.424 9.70e-06 ***
## type_employerLocal-gov -3.216e-02 1.286e-01 -0.250 0.802480
## type_employerUnemployed -1.348e+01 3.688e+02 -0.037 0.970843
## fnlwgt 5.427e-07 2.085e-07 2.603 0.009247 **
## education11th 2.094e-01 2.570e-01 0.815 0.415332
## education12th 3.928e-01 3.410e-01 1.152 0.249272
## education1st-4th -4.591e-01 6.067e-01 -0.757 0.449238
## education5th-6th -8.027e-02 3.980e-01 -0.202 0.840166
## education7th-8th -4.985e-01 2.880e-01 -1.731 0.083453 .
## education9th -1.210e-02 3.191e-01 -0.038 0.969755
## educationAssoc-acdm 1.251e+00 2.165e-01 5.778 7.56e-09 ***
## educationAssoc-voc 1.452e+00 2.084e-01 6.969 3.18e-12 ***
## educationBachelors 2.003e+00 1.938e-01 10.338 < 2e-16 ***
## educationDoctorate 2.869e+00 2.641e-01 10.863 < 2e-16 ***
## educationHS-grad 8.360e-01 1.889e-01 4.427 9.56e-06 ***
## educationMasters 2.348e+00 2.064e-01 11.376 < 2e-16 ***
## educationPreschool -1.879e+01 1.645e+02 -0.114 0.909043
## educationProf-school 2.797e+00 2.468e-01 11.336 < 2e-16 ***
## educationSome-college 1.203e+00 1.915e-01 6.283 3.31e-10 ***
## education_num NA NA NA NA
## maritalMarried 1.280e+00 1.943e-01 6.589 4.43e-11 ***
## maritalNon-Married 5.435e-01 9.953e-02 5.461 4.74e-08 ***
## occupationExec-managerial 7.689e-01 9.095e-02 8.454 < 2e-16 ***
## occupationHandlers-cleaners -7.934e-01 1.726e-01 -4.596 4.30e-06 ***
## occupationProf-specialty 4.964e-01 9.630e-02 5.155 2.54e-07 ***
## occupationOther-service -8.240e-01 1.386e-01 -5.945 2.77e-09 ***
## occupationSales 2.899e-01 9.750e-02 2.974 0.002943 **
## occupationCraft-repair 4.214e-02 9.486e-02 0.444 0.656895
## occupationTransport-moving -1.106e-01 1.190e-01 -0.929 0.352705
## occupationFarming-fishing -1.120e+00 1.619e-01 -6.918 4.58e-12 ***
## occupationMachine-op-inspct -2.190e-01 1.203e-01 -1.821 0.068616 .
## occupationTech-support 6.827e-01 1.325e-01 5.152 2.58e-07 ***
## occupationProtective-serv 6.056e-01 1.494e-01 4.053 5.07e-05 ***
## occupationArmed-Forces -6.250e-01 1.844e+00 -0.339 0.734654
## occupationPriv-house-serv -3.601e+00 1.938e+00 -1.858 0.063212 .
## relationshipNot-in-family -8.662e-01 1.907e-01 -4.541 5.59e-06 ***
## relationshipOther-relative -1.086e+00 2.546e-01 -4.266 1.99e-05 ***
## relationshipOwn-child -1.797e+00 2.357e-01 -7.624 2.46e-14 ***
## relationshipUnmarried -1.030e+00 2.154e-01 -4.782 1.74e-06 ***
## relationshipWife 1.475e+00 1.235e-01 11.948 < 2e-16 ***
## raceAsian-Pac-Islander 6.069e-01 3.207e-01 1.893 0.058403 .
## raceBlack 4.534e-01 2.848e-01 1.592 0.111327
## raceOther 4.236e-02 4.217e-01 0.100 0.920005
## raceWhite 6.599e-01 2.712e-01 2.434 0.014948 *
## sexMale 8.847e-01 9.382e-02 9.430 < 2e-16 ***
## capital_gain 3.193e-04 1.273e-05 25.077 < 2e-16 ***
## capital_loss 6.551e-04 4.562e-05 14.359 < 2e-16 ***
## hr_per_week 2.907e-02 1.988e-03 14.624 < 2e-16 ***
## regionLatin.and.South.America -5.925e-01 1.595e-01 -3.714 0.000204 ***
## regionAsia -6.469e-02 2.044e-01 -0.316 0.751670
## regionOther -4.299e-01 1.651e-01 -2.604 0.009217 **
## regionEurope 4.420e-02 1.552e-01 0.285 0.775868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24138 on 21502 degrees of freedom
## Residual deviance: 14004 on 21448 degrees of freedom
## AIC: 14114
##
## Number of Fisher Scoring iterations: 14
We have still a lot of features. Some important, some not much. R comes with an awesome function called step(). The step() function iteratively tries to remove predictor variables from the model in an attempt to delete variables that do not significatnyl add to the fit. How does it do this? It uses AIC.
step() funciton seletcts a formula based model by AIC.
Usage: step(object, scope, scale = 0, direction = c(“both”, “backward”, “forward”), trace = 1, keep = NULL, steps = 1000, k = 2, …
tep uses add1 and drop1 repeatedly; it will work for any method for which they work, and that is determined by having a valid method for extractAIC. When the additive constant can be chosen so that AIC is equal to Mallows’ Cp, this is done and the tables are labelled appropriately.
The set of models searched is determined by the scope argument. The right-hand-side of its lower component is always included in the model, and right-hand-side of the model is included in the upper component. If scope is a single formula, it specifies the upper component, and the lower model is empty. If scope is missing, the initial model is used as the upper model.
Models specified by scope can be templates to update object as used by update.formula. So using . in a scope formula means ‘what is already there’, with .^2 indicating all interactions of existing terms.
There is a potential problem in using glm fits with a variable scale, as in that case the deviance is not simply related to the maximized log-likelihood. The “glm” method for function extractAIC makes the appropriate adjustment for a gaussian family, but may need to be amended for other cases. (The binomial and poisson families have fixed scale by default and do not correspond to a particular maximum-likelihood problem for variable scale.)
the stepwise-selected model is returned, with up to two additional components. There is an “anova” component corresponding to the steps taken in the search, as well as a “keep” component if the keep= argument was supplied in the call. The “Resid. Dev” column of the analysis of deviance table refers to a constant minus twice the maximized log likelihood: it will be a deviance only in cases where a saturated model is well-defined (thus excluding lm, aov and survreg fits, for example).
The model fitting must apply the models to the same dataset. This may be a problem if there are missing values and R’s default of na.action = na.omit is used. We suggest you remove the missing values first.
Calls to the function nobs are used to check that the number of observations involved in the fitting process remains unchanged.
This function differs considerably from the function in S, which uses a number of approximations and does not in general compute the correct AIC.
This is a minimal implementation. Use stepAIC in package MASS for a wider range of object classes.
We use new.model <- step(model.name) to use the step() function to create a new model.
new.step.model <- step(adult.model)
## Start: AIC=14113.99
## income ~ age + type_employer + fnlwgt + education + education_num +
## marital + occupation + relationship + race + sex + capital_gain +
## capital_loss + hr_per_week + region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=14113.99
## income ~ age + type_employer + fnlwgt + education + marital +
## occupation + relationship + race + sex + capital_gain + capital_loss +
## hr_per_week + region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 14004 14114
## - fnlwgt 1 14011 14119
## - race 4 14019 14121
## - region 4 14026 14128
## - type_employer 5 14050 14150
## - marital 2 14060 14166
## - sex 1 14096 14204
## - age 1 14165 14273
## - capital_loss 1 14217 14325
## - hr_per_week 1 14222 14330
## - relationship 5 14288 14388
## - occupation 13 14443 14527
## - education 15 14718 14798
## - capital_gain 1 15248 15356
summary(new.step.model)
##
## Call:
## glm(formula = income ~ age + type_employer + fnlwgt + education +
## marital + occupation + relationship + race + sex + capital_gain +
## capital_loss + hr_per_week + region, family = binomial(logit),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1169 -0.5168 -0.1966 0.0000 3.6269
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.346e+00 4.307e-01 -17.056 < 2e-16 ***
## age 2.536e-02 2.008e-03 12.629 < 2e-16 ***
## type_employerself-emp -1.266e-02 1.208e-01 -0.105 0.916508
## type_employerPrivate 2.170e-01 1.088e-01 1.994 0.046190 *
## type_employerFederal-gov 6.634e-01 1.500e-01 4.424 9.70e-06 ***
## type_employerLocal-gov -3.216e-02 1.286e-01 -0.250 0.802480
## type_employerUnemployed -1.348e+01 3.688e+02 -0.037 0.970843
## fnlwgt 5.427e-07 2.085e-07 2.603 0.009247 **
## education11th 2.094e-01 2.570e-01 0.815 0.415332
## education12th 3.928e-01 3.410e-01 1.152 0.249272
## education1st-4th -4.591e-01 6.067e-01 -0.757 0.449238
## education5th-6th -8.027e-02 3.980e-01 -0.202 0.840166
## education7th-8th -4.985e-01 2.880e-01 -1.731 0.083453 .
## education9th -1.210e-02 3.191e-01 -0.038 0.969755
## educationAssoc-acdm 1.251e+00 2.165e-01 5.778 7.56e-09 ***
## educationAssoc-voc 1.452e+00 2.084e-01 6.969 3.18e-12 ***
## educationBachelors 2.003e+00 1.938e-01 10.338 < 2e-16 ***
## educationDoctorate 2.869e+00 2.641e-01 10.863 < 2e-16 ***
## educationHS-grad 8.360e-01 1.889e-01 4.427 9.56e-06 ***
## educationMasters 2.348e+00 2.064e-01 11.376 < 2e-16 ***
## educationPreschool -1.879e+01 1.645e+02 -0.114 0.909043
## educationProf-school 2.797e+00 2.468e-01 11.336 < 2e-16 ***
## educationSome-college 1.203e+00 1.915e-01 6.283 3.31e-10 ***
## maritalMarried 1.280e+00 1.943e-01 6.589 4.43e-11 ***
## maritalNon-Married 5.435e-01 9.953e-02 5.461 4.74e-08 ***
## occupationExec-managerial 7.689e-01 9.095e-02 8.454 < 2e-16 ***
## occupationHandlers-cleaners -7.934e-01 1.726e-01 -4.596 4.30e-06 ***
## occupationProf-specialty 4.964e-01 9.630e-02 5.155 2.54e-07 ***
## occupationOther-service -8.240e-01 1.386e-01 -5.945 2.77e-09 ***
## occupationSales 2.899e-01 9.750e-02 2.974 0.002943 **
## occupationCraft-repair 4.214e-02 9.486e-02 0.444 0.656895
## occupationTransport-moving -1.106e-01 1.190e-01 -0.929 0.352705
## occupationFarming-fishing -1.120e+00 1.619e-01 -6.918 4.58e-12 ***
## occupationMachine-op-inspct -2.190e-01 1.203e-01 -1.821 0.068616 .
## occupationTech-support 6.827e-01 1.325e-01 5.152 2.58e-07 ***
## occupationProtective-serv 6.056e-01 1.494e-01 4.053 5.07e-05 ***
## occupationArmed-Forces -6.250e-01 1.844e+00 -0.339 0.734654
## occupationPriv-house-serv -3.601e+00 1.938e+00 -1.858 0.063212 .
## relationshipNot-in-family -8.662e-01 1.907e-01 -4.541 5.59e-06 ***
## relationshipOther-relative -1.086e+00 2.546e-01 -4.266 1.99e-05 ***
## relationshipOwn-child -1.797e+00 2.357e-01 -7.624 2.46e-14 ***
## relationshipUnmarried -1.030e+00 2.154e-01 -4.782 1.74e-06 ***
## relationshipWife 1.475e+00 1.235e-01 11.948 < 2e-16 ***
## raceAsian-Pac-Islander 6.069e-01 3.207e-01 1.893 0.058403 .
## raceBlack 4.534e-01 2.848e-01 1.592 0.111327
## raceOther 4.236e-02 4.217e-01 0.100 0.920005
## raceWhite 6.599e-01 2.712e-01 2.434 0.014948 *
## sexMale 8.847e-01 9.382e-02 9.430 < 2e-16 ***
## capital_gain 3.193e-04 1.273e-05 25.077 < 2e-16 ***
## capital_loss 6.551e-04 4.562e-05 14.359 < 2e-16 ***
## hr_per_week 2.907e-02 1.988e-03 14.624 < 2e-16 ***
## regionLatin.and.South.America -5.925e-01 1.595e-01 -3.714 0.000204 ***
## regionAsia -6.469e-02 2.044e-01 -0.316 0.751670
## regionOther -4.299e-01 1.651e-01 -2.604 0.009217 **
## regionEurope 4.420e-02 1.552e-01 0.285 0.775868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24138 on 21502 degrees of freedom
## Residual deviance: 14004 on 21448 degrees of freedom
## AIC: 14114
##
## Number of Fisher Scoring iterations: 14
We notice that the step() function kept all the features used previously.While we used the AIC criteria to compare models, there are other criteria we could have used. If you want you can try reading about the variable inflation factor (VIF) and vif() function to explore other options for comparison criteria. In the meantime, let’s continue on and see how well our model performed against the test set.
We create a confusion matrix using the predict function with type=“response” as an argument inside of that function.
test$predicted.income=predict(adult.model,newdata=test,type='response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
table(test$income,test$predicted.income>0.5)
##
## FALSE TRUE
## <=50K 6372 548
## >50K 873 1422
The accuracy of the model is
(6372+1422) / (6372+1423+548+872)
## [1] 0.8457949
Other measures of performance, recall or precision are:
# recall
6372/(6372+548)
## [1] 0.9208092
# precision
6372 / (6372+872)
## [1] 0.8796245
Reference: ISLR Chapter 4
K Nearest Neighbors is a classification algorithm that operates on avery simple principle. It is best shown through example. Imagine we have some imaginary data on Dogs and Horses, with heights and weights.
Training algorith: 1. Store all the data
Prediction algorithm: 1. Calculate the distance from x to all points in the data 2. Sort the points in the data by increasing distance from x 3. Predict the majority label of the “k” closest points
Choosing a K will affect what class a new point is assigned
Pros - very simple - training is trivial - works with any number of lcasses - Easy to add more data - Few parameters: K and Distance metric
Cons - high prediction costs - not good with high dimensional data - categorical feastures don’t work well
Dataset: Caravan from ISLR package
# install.packages("ISLR")
library(ISLR)
summary(Caravan)
## MOSTYPE MAANTHUI MGEMOMV MGEMLEEF
## Min. : 1.00 Min. : 1.000 Min. :1.000 Min. :1.000
## 1st Qu.:10.00 1st Qu.: 1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :30.00 Median : 1.000 Median :3.000 Median :3.000
## Mean :24.25 Mean : 1.111 Mean :2.679 Mean :2.991
## 3rd Qu.:35.00 3rd Qu.: 1.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :41.00 Max. :10.000 Max. :5.000 Max. :6.000
## MOSHOOFD MGODRK MGODPR MGODOV
## Min. : 1.000 Min. :0.0000 Min. :0.000 Min. :0.00
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:4.000 1st Qu.:0.00
## Median : 7.000 Median :0.0000 Median :5.000 Median :1.00
## Mean : 5.774 Mean :0.6965 Mean :4.627 Mean :1.07
## 3rd Qu.: 8.000 3rd Qu.:1.0000 3rd Qu.:6.000 3rd Qu.:2.00
## Max. :10.000 Max. :9.0000 Max. :9.000 Max. :5.00
## MGODGE MRELGE MRELSA MRELOV
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.00
## 1st Qu.:2.000 1st Qu.:5.000 1st Qu.:0.0000 1st Qu.:1.00
## Median :3.000 Median :6.000 Median :1.0000 Median :2.00
## Mean :3.259 Mean :6.183 Mean :0.8835 Mean :2.29
## 3rd Qu.:4.000 3rd Qu.:7.000 3rd Qu.:1.0000 3rd Qu.:3.00
## Max. :9.000 Max. :9.000 Max. :7.0000 Max. :9.00
## MFALLEEN MFGEKIND MFWEKIND MOPLHOOG
## Min. :0.000 Min. :0.00 Min. :0.0 Min. :0.000
## 1st Qu.:0.000 1st Qu.:2.00 1st Qu.:3.0 1st Qu.:0.000
## Median :2.000 Median :3.00 Median :4.0 Median :1.000
## Mean :1.888 Mean :3.23 Mean :4.3 Mean :1.461
## 3rd Qu.:3.000 3rd Qu.:4.00 3rd Qu.:6.0 3rd Qu.:2.000
## Max. :9.000 Max. :9.00 Max. :9.0 Max. :9.000
## MOPLMIDD MOPLLAAG MBERHOOG MBERZELF
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:0.000 1st Qu.:0.000
## Median :3.000 Median :5.000 Median :2.000 Median :0.000
## Mean :3.351 Mean :4.572 Mean :1.895 Mean :0.398
## 3rd Qu.:4.000 3rd Qu.:6.000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :5.000
## MBERBOER MBERMIDD MBERARBG MBERARBO
## Min. :0.0000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:1.000
## Median :0.0000 Median :3.000 Median :2.00 Median :2.000
## Mean :0.5223 Mean :2.899 Mean :2.22 Mean :2.306
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :9.0000 Max. :9.000 Max. :9.00 Max. :9.000
## MSKA MSKB1 MSKB2 MSKC
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :1.000 Median :2.000 Median :2.000 Median :4.000
## Mean :1.621 Mean :1.607 Mean :2.203 Mean :3.759
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
## MSKD MHHUUR MHKOOP MAUT1
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:5.00
## Median :1.000 Median :4.000 Median :5.000 Median :6.00
## Mean :1.067 Mean :4.237 Mean :4.772 Mean :6.04
## 3rd Qu.:2.000 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:7.00
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.00
## MAUT2 MAUT0 MZFONDS MZPART
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:5.000 1st Qu.:1.000
## Median :1.000 Median :2.000 Median :7.000 Median :2.000
## Mean :1.316 Mean :1.959 Mean :6.277 Mean :2.729
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:8.000 3rd Qu.:4.000
## Max. :7.000 Max. :9.000 Max. :9.000 Max. :9.000
## MINKM30 MINK3045 MINK4575 MINK7512
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :2.000 Median :4.000 Median :3.000 Median :0.0000
## Mean :2.574 Mean :3.536 Mean :2.731 Mean :0.7961
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.0000
## MINK123M MINKGEM MKOOPKLA PWAPART
## Min. :0.0000 Min. :0.000 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:0.0000
## Median :0.0000 Median :4.000 Median :4.000 Median :0.0000
## Mean :0.2027 Mean :3.784 Mean :4.236 Mean :0.7712
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:6.000 3rd Qu.:2.0000
## Max. :9.0000 Max. :9.000 Max. :8.000 Max. :3.0000
## PWABEDR PWALAND PPERSAUT PBESAUT
## Min. :0.00000 Min. :0.00000 Min. :0.00 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :5.00 Median :0.00000
## Mean :0.04002 Mean :0.07162 Mean :2.97 Mean :0.04827
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:6.00 3rd Qu.:0.00000
## Max. :6.00000 Max. :4.00000 Max. :8.00 Max. :7.00000
## PMOTSCO PVRAAUT PAANHANG PTRACTOR
## Min. :0.0000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.1754 Mean :0.009447 Mean :0.02096 Mean :0.09258
## 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :7.0000 Max. :9.000000 Max. :5.00000 Max. :6.00000
## PWERKT PBROM PLEVEN PPERSONG
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.000 Median :0.0000 Median :0.00000
## Mean :0.01305 Mean :0.215 Mean :0.1948 Mean :0.01374
## 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :6.00000 Max. :6.000 Max. :9.0000 Max. :6.00000
## PGEZONG PWAOREG PBRAND PZEILPL
## Min. :0.00000 Min. :0.00000 Min. :0.000 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000000
## Median :0.00000 Median :0.00000 Median :2.000 Median :0.0000000
## Mean :0.01529 Mean :0.02353 Mean :1.828 Mean :0.0008588
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.:0.0000000
## Max. :3.00000 Max. :7.00000 Max. :8.000 Max. :3.0000000
## PPLEZIER PFIETS PINBOED PBYSTAND
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.01889 Mean :0.02525 Mean :0.01563 Mean :0.04758
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :6.00000 Max. :1.00000 Max. :6.00000 Max. :5.00000
## AWAPART AWABEDR AWALAND APERSAUT
## Min. :0.000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.000 Median :0.00000 Median :0.00000 Median :1.0000
## Mean :0.403 Mean :0.01477 Mean :0.02061 Mean :0.5622
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :2.000 Max. :5.00000 Max. :1.00000 Max. :7.0000
## ABESAUT AMOTSCO AVRAAUT AAANHANG
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.01048 Mean :0.04105 Mean :0.002233 Mean :0.01254
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :4.00000 Max. :8.00000 Max. :3.000000 Max. :3.00000
## ATRACTOR AWERKT ABROM ALEVEN
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.03367 Mean :0.006183 Mean :0.07042 Mean :0.07661
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :4.00000 Max. :6.000000 Max. :2.00000 Max. :8.00000
## APERSONG AGEZONG AWAOREG ABRAND
## Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.000000 Median :0.000000 Median :1.0000
## Mean :0.005325 Mean :0.006527 Mean :0.004638 Mean :0.5701
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :1.000000 Max. :1.000000 Max. :2.000000 Max. :7.0000
## AZEILPL APLEZIER AFIETS
## Min. :0.0000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.0000000 Median :0.000000 Median :0.00000
## Mean :0.0005153 Mean :0.006012 Mean :0.03178
## 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :2.000000 Max. :3.00000
## AINBOED ABYSTAND Purchase
## Min. :0.000000 Min. :0.00000 No :5474
## 1st Qu.:0.000000 1st Qu.:0.00000 Yes: 348
## Median :0.000000 Median :0.00000
## Mean :0.007901 Mean :0.01426
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :2.000000 Max. :2.00000
any(is.na(Caravan))
## [1] FALSE
library(Amelia)
missmap(Caravan)
head(Caravan)
## MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR MGODOV MGODGE
## 1 33 1 3 2 8 0 5 1 3
## 2 37 1 2 2 8 1 4 1 4
## 3 37 1 2 2 8 0 4 2 4
## 4 9 1 3 3 3 2 3 2 4
## 5 40 1 4 2 10 1 4 1 4
## 6 23 1 2 1 5 0 5 0 5
## MRELGE MRELSA MRELOV MFALLEEN MFGEKIND MFWEKIND MOPLHOOG MOPLMIDD
## 1 7 0 2 1 2 6 1 2
## 2 6 2 2 0 4 5 0 5
## 3 3 2 4 4 4 2 0 5
## 4 5 2 2 2 3 4 3 4
## 5 7 1 2 2 4 4 5 4
## 6 0 6 3 3 5 2 0 5
## MOPLLAAG MBERHOOG MBERZELF MBERBOER MBERMIDD MBERARBG MBERARBO MSKA
## 1 7 1 0 1 2 5 2 1
## 2 4 0 0 0 5 0 4 0
## 3 4 0 0 0 7 0 2 0
## 4 2 4 0 0 3 1 2 3
## 5 0 0 5 4 0 0 0 9
## 6 4 2 0 0 4 2 2 2
## MSKB1 MSKB2 MSKC MSKD MHHUUR MHKOOP MAUT1 MAUT2 MAUT0 MZFONDS MZPART
## 1 1 2 6 1 1 8 8 0 1 8 1
## 2 2 3 5 0 2 7 7 1 2 6 3
## 3 5 0 4 0 7 2 7 0 2 9 0
## 4 2 1 4 0 5 4 9 0 0 7 2
## 5 0 0 0 0 4 5 6 2 1 5 4
## 6 2 2 4 2 9 0 5 3 3 9 0
## MINKM30 MINK3045 MINK4575 MINK7512 MINK123M MINKGEM MKOOPKLA PWAPART
## 1 0 4 5 0 0 4 3 0
## 2 2 0 5 2 0 5 4 2
## 3 4 5 0 0 0 3 4 2
## 4 1 5 3 0 0 4 4 0
## 5 0 0 9 0 0 6 3 0
## 6 5 2 3 0 0 3 3 0
## PWABEDR PWALAND PPERSAUT PBESAUT PMOTSCO PVRAAUT PAANHANG PTRACTOR
## 1 0 0 6 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 6 0 0 0 0 0
## 4 0 0 6 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 6 0 0 0 0 0
## PWERKT PBROM PLEVEN PPERSONG PGEZONG PWAOREG PBRAND PZEILPL PPLEZIER
## 1 0 0 0 0 0 0 5 0 0
## 2 0 0 0 0 0 0 2 0 0
## 3 0 0 0 0 0 0 2 0 0
## 4 0 0 0 0 0 0 2 0 0
## 5 0 0 0 0 0 0 6 0 0
## 6 0 0 0 0 0 0 0 0 0
## PFIETS PINBOED PBYSTAND AWAPART AWABEDR AWALAND APERSAUT ABESAUT AMOTSCO
## 1 0 0 0 0 0 0 1 0 0
## 2 0 0 0 2 0 0 0 0 0
## 3 0 0 0 1 0 0 1 0 0
## 4 0 0 0 0 0 0 1 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 1 0 0
## AVRAAUT AAANHANG ATRACTOR AWERKT ABROM ALEVEN APERSONG AGEZONG AWAOREG
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## ABRAND AZEILPL APLEZIER AFIETS AINBOED ABYSTAND Purchase
## 1 1 0 0 0 0 0 No
## 2 1 0 0 0 0 0 No
## 3 1 0 0 0 0 0 No
## 4 1 0 0 0 0 0 No
## 5 1 0 0 0 0 0 No
## 6 0 0 0 0 0 0 No
Next, we look at each individual value.
var(Caravan[,1])
## [1] 165.0378
var(Caravan[,2])
## [1] 0.1647078
purchase <- Caravan[,86]
standardized.Caravan <- scale(Caravan[,-86])
print(var(standardized.Caravan[,1]))
## [1] 1
print(var(standardized.Caravan[,2]))
## [1] 1
# Train test split
## Test set
test.index <- 1:1000
test.data <- standardized.Caravan[test.index,]
test.purchase <- purchase[test.index]
# Train set
train.data <- standardized.Caravan[-test.index,]
train.purchase <- purchase[-test.index]
Based on the above training/test set, we fit a KNN model.
library(class)
## KNN model
set.seed(101)
predicted.purchase <- knn(train.data,test.data,train.purchase,k=1)
head(predicted.purchase)
## [1] No No No No No No
## Levels: No Yes
# misclass
mean(test.purchase != predicted.purchase)
## [1] 0.116
misclass.error <- mean(test.purchase!=predicted.purchase)
print(misclass.error)
## [1] 0.116
# k=3 case misclassification rate = 0.073
predicted.purchase <- knn(train.data,test.data,train.purchase,k=3)
mean(test.purchase != predicted.purchase)
## [1] 0.073
# k=5 case
redicted.purchase <- knn(train.data,test.data,train.purchase,k=5)
mean(test.purchase != predicted.purchase)
## [1] 0.073
#### Choosing a K value
predicted.purchase = NULL
error.rate = NULL
for(i in 1:20){
set.seed(101)
predicted.purchase = knn(train.data,test.data,train.purchase,k=i)
error.rate[i] = mean(test.purchase != predicted.purchase)
}
print(error.rate)
## [1] 0.116 0.107 0.074 0.070 0.066 0.064 0.062 0.061 0.058 0.058 0.059
## [12] 0.058 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059
# visualize K elbow method
library(ggplot2)
k.values <- 1:20
error.df <- data.frame(error.rate,k.values)
ggplot(error.df,aes(k.values,error.rate))+geom_point()+geom_line(lty='dotted',color='red')
since KNN is such a simple algorithm, we will use this project as a simple exercise to test the implemention of KNN. For this project, just follow along with the bolded instructions. It should be very simple, so at the end, we will have an additional optional bonus project.
We will use the famous iris data set for this project. It’s a small data set with flower features that can be used to attempt to predict the species of an iris flower.
First, we use the ISLR library to get the iris data set, and check the heading of the data frame.
library(ISLR)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
In this case, th iris data set has all its features in the same order ofd magnitutde, but its good practice (especially with KNN) to standardize features in the data. Let’s go ahead and do this even though its not necessary for this data.
Use scale() to standaradize the feature columns of the iris data set. Set this standardized version of the data as a new variale.
var(iris[,1])
## [1] 0.6856935
var(iris[,2])
## [1] 0.1899794
var(iris[,3])
## [1] 3.116278
var(iris[,4])
## [1] 0.5810063
species <- iris[,5]
standardized.iris <- scale(iris[,-5])
var(standardized.iris[,1])
## [1] 1
var(standardized.iris[,2])
## [1] 1
var(standardized.iris[,3])
## [1] 1
var(standardized.iris[,4])
## [1] 1
head(standardized.iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] -0.8976739 1.01560199 -1.335752 -1.311052
## [2,] -1.1392005 -0.13153881 -1.335752 -1.311052
## [3,] -1.3807271 0.32731751 -1.392399 -1.311052
## [4,] -1.5014904 0.09788935 -1.279104 -1.311052
## [5,] -1.0184372 1.24503015 -1.335752 -1.311052
## [6,] -0.5353840 1.93331463 -1.165809 -1.048667
dim(standardized.iris)
## [1] 150 4
Then, we use the caTools library to split the standardized data into train and test sets. Use
library(magrittr)
iris.train <- sample(150,105)
iris.test <- (-iris.train)
iris.train.df <- standardized.iris[iris.train,] %>% as.data.frame()
iris.test.df <- standardized.iris[iris.test,] %>% as.data.frame()
head(iris.test.df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 -1.01843718 1.24503015 -1.335752 -1.311052
## 2 -1.13920048 0.09788935 -1.279104 -1.442245
## 3 -1.25996379 0.78617383 -1.222456 -1.311052
## 4 -0.05233076 2.16274279 -1.449047 -1.311052
## 5 -0.17309407 3.08045544 -1.279104 -1.048667
## 6 -0.89767388 1.01560199 -1.335752 -1.179859
library(caTools)
sample <- sample.split(iris$Species,SplitRatio=0.7)
train <- subset(iris,sample==TRUE)
test <- subset(iris,sample=FALSE)
We call the class library, and use the knn function to predict species of the test set, with k=1.
# KNN: class package
library(class)
# knn model
set.seed(101)
predicted.species <- knn(train[1:4],test[1:4],train$Species,k=1)
predicted.species
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor virginica versicolor
## [73] virginica versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica versicolor virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica versicolor virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
# misclassification rate
mean(test$Species!=predicted.species)
## [1] 0.02666667
Based on the above results, we do have quite small data, but considering the optimal k values.
predicted.species <- NULL
error.rate <- NULL
for(i in 1:10){
set.seed(101)
predicted.species = knn(train[,1:4],test[,1:4],train$Species,k=i)
error.rate[i]=mean(test$Species!= predicted.species)
}
print(error.rate)
## [1] 0.02666667 0.03333333 0.03333333 0.03333333 0.02666667 0.02666667
## [7] 0.01333333 0.02000000 0.02000000 0.02666667
# visualize K elbow method
library(ggplot2)
k.values <- 1:10
error.df <- data.frame(error.rate,k.values)
ggplot(error.df,aes(k.values,error.rate))+geom_point()+geom_line(lty='dotted',color='red')
We notice that the error drops to its lowest for k values between 2-6 (The above result is not consistent with the solution). Then it begins to jump backup again, this is due to how small the data set is. At k=10, we begine to approach seeking k=10% of the data, which is quite large.
Reference: ISLR Chapter 8
Decision tree - nodes: split for the value of a certain attribute - leaves: terminal nodes that predict the outcome - root: the node that performs the first split - leaves: terminal nodes that predict the outcome
Entropy and Information gain are the mathematical methods of choosing the best split. Entropy: \[H(S) = - \Sigma_i p_i(S) log_2p_i(S)\] Information Gain: \[IG(S,A)0=H(s)-\Sigma \frac{|S_v|}{S}H(S_v)\]
To improve performance, we can use many trees with a randome sample of features chosen as the split. - a new randome sample of features is chosen for every single tree at every single split. - for classification, m is typically chosen to be the square root of p.
Suppose there is a one very strong feature in the data set. When using “bagged” trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are highly correlated. - Averaging highly correlated quantities does not significantly reduce variance. - By randomly leaving out candidate features from each split, random forest decorrelates the trees, such that the averaging process can reduce the variance of the resulting model.
Decision trees: rpart package
library(rpart)
str(kyphosis)
## 'data.frame': 81 obs. of 4 variables:
## $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
## $ Age : int 71 158 128 2 1 1 61 37 113 59 ...
## $ Number : int 3 3 4 5 4 2 2 3 2 6 ...
## $ Start : int 5 14 5 1 15 16 17 16 16 12 ...
head(kyphosis)
## Kyphosis Age Number Start
## 1 absent 71 3 5
## 2 absent 158 3 14
## 3 present 128 4 5
## 4 absent 2 5 1
## 5 absent 1 4 15
## 6 absent 1 2 16
# model application
tree <- rpart(Kyphosis~.,data=kyphosis, ,method='class')
printcp(tree)
##
## Classification tree:
## rpart(formula = Kyphosis ~ ., data = kyphosis, method = "class")
##
## Variables actually used in tree construction:
## [1] Age Start
##
## Root node error: 17/81 = 0.20988
##
## n= 81
##
## CP nsplit rel error xerror xstd
## 1 0.176471 0 1.00000 1.0000 0.21559
## 2 0.019608 1 0.82353 1.0588 0.22010
## 3 0.010000 4 0.76471 1.0588 0.22010
library(magrittr)
plot(tree,uniform=T, main='Kythosis Tree')
text(tree,use.n=T,all=T)
there are multiple functions available for decision tree model - printcp: display cp table - plotcp: plot cross-validation results - rsq.rpart: plot approximate R-squared and relative error for different splits, labels are only approproate for the anova method - print:print results - summary: detailed results including surrogate splits - plot: plot decision tree - text:label the decision tree plot - post: create postscript plot of decision tree
The above function can become better with package “rpart.plot”
library(rpart.plot)
prp(tree)
Randome free improves predictive accuracy by generating bootstrap trees based on random samples of variables classifying a case using history in the new forest sighting a final predicted outcome by combining the results across all of the trees and in classification.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf.model <- randomForest(Kyphosis~.,data=kyphosis)
print(rf.model)
##
## Call:
## randomForest(formula = Kyphosis ~ ., data = kyphosis)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 18.52%
## Confusion matrix:
## absent present class.error
## absent 61 3 0.0468750
## present 12 5 0.7058824
rf.model$predicted
## 1 2 3 4 5 6 7 8 9
## absent absent present absent absent absent absent absent absent
## 10 11 12 13 14 15 16 17 18
## absent absent absent absent absent absent absent absent absent
## 19 20 21 22 23 24 25 26 27
## absent absent absent present absent present absent absent absent
## 28 29 30 31 32 33 34 35 36
## absent absent absent absent absent absent absent absent absent
## 37 38 39 40 41 42 43 44 45
## absent absent absent absent absent absent present absent absent
## 46 47 48 49 50 51 52 53 54
## absent absent absent absent absent absent absent absent absent
## 55 56 57 58 59 60 61 62 63
## absent absent absent present present absent absent present absent
## 64 65 66 67 68 69 70 71 72
## absent absent absent absent absent absent absent absent absent
## 73 74 75 76 77 78 79 80 81
## absent absent absent absent absent absent absent present absent
## Levels: absent present
rf.model$ntree
## [1] 500
rf.model$confusion
## absent present class.error
## absent 61 3 0.0468750
## present 12 5 0.7058824
For this project, we will be exploring the use of tree methods to classify schools as Private or Public based off their features. Let’s start by getting the data which is included in the ISLR library, the College data frame.
A data frame with 777 observations on the following 18 variables. -Private A factor with levels No and Yes indicating private or public university - Apps Number of applications received - Accept Number of applications accepted - Enroll Number of new students enrolled - Top10perc Pct. new students from top 10% of H.S. class - Top25perc Pct. new students from top 25% of H.S. class - F.Undergrad Number of fulltime undergraduates - P.Undergrad Number of parttime undergraduates - Outstate Out-of-state tuition - Room.Board Room and board costs - Books Estimated book costs - Personal Estimated personal spending - PhD Pct. of faculty with Ph.D.’s - Terminal Pct. of faculty with terminal degree - S.F.Ratio Student/faculty ratio - perc.alumni Pct. alumni who donate - Expend Instructional expenditure per student - Grad.Rate Graduation rate
We first call the ISLR library and check the head of College (a built-in data frame with ISLR, use data() to check this). Then reassign College to a dataframe called df.
library(ISLR)
library(magrittr)
df <- College
head(df,10)
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Albertus Magnus College Yes 353 340 103 17
## Albion College Yes 1899 1720 489 37
## Albright College Yes 1038 839 227 30
## Alderson-Broaddus College Yes 582 498 172 21
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 52 2885 537 7440
## Adelphi University 29 2683 1227 12280
## Adrian College 50 1036 99 11250
## Agnes Scott College 89 510 63 12960
## Alaska Pacific University 44 249 869 7560
## Albertson College 62 678 41 13500
## Albertus Magnus College 45 416 230 13290
## Albion College 68 1594 32 13868
## Albright College 63 973 306 15595
## Alderson-Broaddus College 44 799 78 10468
## Room.Board Books Personal PhD Terminal
## Abilene Christian University 3300 450 2200 70 78
## Adelphi University 6450 750 1500 29 30
## Adrian College 3750 400 1165 53 66
## Agnes Scott College 5450 450 875 92 97
## Alaska Pacific University 4120 800 1500 76 72
## Albertson College 3335 500 675 67 73
## Albertus Magnus College 5720 500 1500 90 93
## Albion College 4826 450 850 89 100
## Albright College 4400 300 500 79 84
## Alderson-Broaddus College 3380 660 1800 40 41
## S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University 18.1 12 7041 60
## Adelphi University 12.2 16 10527 56
## Adrian College 12.9 30 8735 54
## Agnes Scott College 7.7 37 19016 59
## Alaska Pacific University 11.9 2 10922 15
## Albertson College 9.4 11 9727 55
## Albertus Magnus College 11.5 26 8861 63
## Albion College 13.7 37 11487 73
## Albright College 11.3 23 11644 80
## Alderson-Broaddus College 11.5 15 8991 52
We create a scatterplot of Grad.Rate versus Room.Board, colored by the Private column.
library(ggplot2)
ggplot(df,aes(x=Room.Board,y=Grad.Rate))+geom_point(aes(color=Private))
Then, we also create a histogram of full time undergrad students, color by Private.
ggplot(df,aes(x=F.Undergrad))+geom_histogram(aes(fill=Private),color='light grey')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Create a histogram of Grad.Rate colored by Private,
ggplot(df,aes(x=Grad.Rate))+geom_histogram(aes(fill=Private),color="light gray")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(df,aes(x=Grad.Rate))+geom_histogram(aes(fill=Private),color="light gray",bins=50)
What college had a Graduation rate of above 100%?
library(magrittr)
head(df)
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 52 2885 537 7440
## Adelphi University 29 2683 1227 12280
## Adrian College 50 1036 99 11250
## Agnes Scott College 89 510 63 12960
## Alaska Pacific University 44 249 869 7560
## Albertson College 62 678 41 13500
## Room.Board Books Personal PhD Terminal
## Abilene Christian University 3300 450 2200 70 78
## Adelphi University 6450 750 1500 29 30
## Adrian College 3750 400 1165 53 66
## Agnes Scott College 5450 450 875 92 97
## Alaska Pacific University 4120 800 1500 76 72
## Albertson College 3335 500 675 67 73
## S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University 18.1 12 7041 60
## Adelphi University 12.2 16 10527 56
## Adrian College 12.9 30 8735 54
## Agnes Scott College 7.7 37 19016 59
## Alaska Pacific University 11.9 2 10922 15
## Albertson College 9.4 11 9727 55
# subset() function
subset(df,Grad.Rate>100)
## Private Apps Accept Enroll Top10perc Top25perc
## Cazenovia College Yes 3847 3433 527 9 35
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Cazenovia College 1010 12 9384 4840 600
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Cazenovia College 500 22 47 14.3 20 7697
## Grad.Rate
## Cazenovia College 118
As we want to change the college’s grad rate to 100%, we simply re-define the data frame.
df['Cazenovia college','Grad.Rate'] <- 100
Next, we split the data into training and testing sets 70/30, by using caTools library for this.
library(caTools)
set.seed(101)
sample = sample.split(df$Private,SplitRatio=.70)
train=subset(df,sample==TRUE)
test=subset(df,sample==FALSE)
We use the rpart library to bild a decision tree to predict whether or not a school is Private. Remember to only build tree off the training data.
library(rpart)
tree <- rpart(Private~.,,method='class',data=train)
We use predict() to predict the Private label on the test data, and check the head of the predictied values. We could notice we do have two columns with the probabilities.
tree.preds <- predict(tree,test)
head(tree.preds)
## No Yes
## Adrian College 0.003311258 0.9966887
## Alfred University 0.003311258 0.9966887
## Allegheny College 0.003311258 0.9966887
## Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887
## Alma College 0.003311258 0.9966887
## Amherst College 0.003311258 0.9966887
We could turn these columns into one column to match the original Yes/No label for a private column.
tree.preds <- as.data.frame(tree.preds)
joiner <- function(x){
if (x>=0.5){
return('Yes')
}else {
return('No')
}
}
tree.preds$Private <- sapply(tree.preds$Yes,joiner)
head(tree.preds)
## No Yes Private
## Adrian College 0.003311258 0.9966887 Yes
## Alfred University 0.003311258 0.9966887 Yes
## Allegheny College 0.003311258 0.9966887 Yes
## Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887 Yes
## Alma College 0.003311258 0.9966887 Yes
## Amherst College 0.003311258 0.9966887 Yes
Based on the above, we use table() to create a confusion matrix of the tree model. We can plot such tree model by prp() function.
# confusion matrix
table(tree.preds$Private,test$Private)
##
## No Yes
## No 57 9
## Yes 7 160
# plot by prp()
library(rpart.plot)
prp(tree)
Let’s build out a random forest model. First, we call the randomForest package library, and use randomForest() to build out a model to predict Private class. Add importance=TRUE as a parameter in the model to find out what this does.
library(randomForest)
rf.model <- randomForest(Private~.,data=train,importance=T)
The model’s confusion matrix on its own training set is,
rf.model$confusion
## No Yes class.error
## No 125 23 0.15540541
## Yes 10 386 0.02525253
To grab the feature importance with moel$importance, refer the reading for more info what Gini means.
rf.model$importance
## No Yes MeanDecreaseAccuracy MeanDecreaseGini
## Apps 0.030638008 0.0138490702 0.0183834770 8.191846
## Accept 0.031590637 0.0157698867 0.0199998841 13.280573
## Enroll 0.036691206 0.0284928559 0.0307724157 20.832978
## Top10perc 0.008783509 0.0041456953 0.0053787323 5.250942
## Top25perc 0.007412220 0.0044694518 0.0052145181 4.501693
## F.Undergrad 0.154011049 0.0748721489 0.0962450156 40.463681
## P.Undergrad 0.048475334 0.0073411365 0.0184278295 15.656301
## Outstate 0.141095168 0.0613496370 0.0830099419 42.602561
## Room.Board 0.021518748 0.0144373926 0.0163748212 11.415365
## Books 0.001852497 0.0002509989 0.0007121824 2.248328
## Personal 0.003032677 0.0016272012 0.0019899372 3.288314
## PhD 0.006891102 0.0048891976 0.0054043435 4.401988
## Terminal 0.005552642 0.0041588988 0.0045469137 4.373904
## S.F.Ratio 0.031130358 0.0096029611 0.0153991012 15.809862
## perc.alumni 0.021092075 0.0041202501 0.0087216547 5.065061
## Expend 0.025496668 0.0134694646 0.0166589911 10.504956
## Grad.Rate 0.017449980 0.0055589660 0.0088252942 7.442815
We use random forest model to predict the data set.
p <- predict(rf.model,test)
table(p,test$Private)
##
## p No Yes
## No 58 7
## Yes 6 162
It should have performed better than jut a single tree, how much better depends on whether you are emasuring recall, precision, or accuracy as the most important measure of the model.
Reference: ISLR Chapter 11
Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data and recognizes patterns, used for classification and regression analysis.
Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier.
An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
New examples are then mapped into that same space and precited to belong to a category based on which side of the gap they fall on.
# iris dataset
library(ISLR)
print(head(iris))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:coefplot':
##
## extractPath
help('svm')
## starting httpd help server ...
## done
# svm model application
model <- svm(Species ~ .,data=iris)
summary(model)
##
## Call:
## svm(formula = Species ~ ., data = iris)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.25
##
## Number of Support Vectors: 51
##
## ( 8 22 21 )
##
##
## Number of Classes: 3
##
## Levels:
## setosa versicolor virginica
pred.values <- predict(model,iris[1:4])
table(pred.values,iris[,5])
##
## pred.values setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 2 48
Parameters - cost:basically allows the support vector machines to have what’s known as a soft margin, parameter for the soft margin cost function which controls the influence of each individual support vector. - gamma:free parameter of the Gaussian radial basis function (small gamma implies that the class of this support vector is going to have an influence on the side in the class of vector).
# range:list of list and gamma
tune.results <- tune(svm,train.x=iris[1:4],train.y=iris[,5],kernel='radial',ranges=list(cost=c(0.1,1,10),gamma=c(0.5,1,2)))
summary(tune.results)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 0.5
##
## - best performance: 0.04
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.1 0.5 0.06000000 0.05837300
## 2 1.0 0.5 0.04000000 0.04661373
## 3 10.0 0.5 0.05333333 0.06126244
## 4 0.1 1.0 0.05333333 0.05258738
## 5 1.0 1.0 0.04666667 0.03220306
## 6 10.0 1.0 0.06000000 0.04919099
## 7 0.1 2.0 0.08000000 0.05258738
## 8 1.0 2.0 0.05333333 0.04216370
## 9 10.0 2.0 0.06000000 0.05837300
tune.results <- tune(svm,train.x=iris[1:4],train.y=iris[,5],kernel='radial',ranges=list(cost=c(0.5,1,1.5),gamma=c(0.1,0.5,0.7)))
summary(tune.results)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 0.1
##
## - best performance: 0.02666667
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.5 0.1 0.04000000 0.06440612
## 2 1.0 0.1 0.02666667 0.03442652
## 3 1.5 0.1 0.03333333 0.03513642
## 4 0.5 0.5 0.04666667 0.04499657
## 5 1.0 0.5 0.04000000 0.04661373
## 6 1.5 0.5 0.04666667 0.04499657
## 7 0.5 0.7 0.05333333 0.04216370
## 8 1.0 0.7 0.04666667 0.04499657
## 9 1.5 0.7 0.04666667 0.04499657
Based on the above results, cost=1.5,gamma=1
tuned.svm <- svm(Species~.,data=iris,kernel='radial',cost=1.5,gamma=0.1)
summary(tuned.svm)
##
## Call:
## svm(formula = Species ~ ., data = iris, kernel = "radial", cost = 1.5,
## gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.5
## gamma: 0.1
##
## Number of Support Vectors: 50
##
## ( 4 23 23 )
##
##
## Number of Classes: 3
##
## Levels:
## setosa versicolor virginica
For this project, we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
Lending club had a very interesting year in 2016, so let’s check out some of their data and keep the context in mind. This data is from before they even went public.
We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here or just use the csv already provided. It’s recommended you use the csv provided as it has been cleaned of NA values.
Here are what the columns represent: - credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. - purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”). - int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. - installment: The monthly installments owed by the borrower if the loan is funded. - log.annual.inc: The natural log of the self-reported annual income of the borrower. - dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income). - fico: The FICO credit score of the borrower. - days.with.cr.line: The number of days the borrower has had a credit line. - revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). - revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). - inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months. - delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years. - pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
We open the loan_csv file and save it as a dataframe called loans.
loans <- read.csv('loan_data.csv')
We briefly check the aummary and structure of loans.
str(loans)
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
summary(loans)
## credit.policy purpose int.rate
## Min. :0.000 all_other :2331 Min. :0.0600
## 1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
## Median :1.000 debt_consolidation:3957 Median :0.1221
## Mean :0.805 educational : 343 Mean :0.1226
## 3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
## Max. :1.000 major_purchase : 437 Max. :0.2164
## small_business : 619
## installment log.annual.inc dti fico
## Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0
## 1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0
## Median :268.95 Median :10.929 Median :12.665 Median :707.0
## Mean :319.09 Mean :10.932 Mean :12.607 Mean :710.8
## 3rd Qu.:432.76 3rd Qu.:11.291 3rd Qu.:17.950 3rd Qu.:737.0
## Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0
##
## days.with.cr.line revol.bal revol.util inq.last.6mths
## Min. : 179 Min. : 0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.6 1st Qu.: 0.000
## Median : 4140 Median : 8596 Median : 46.3 Median : 1.000
## Mean : 4561 Mean : 16914 Mean : 46.8 Mean : 1.577
## 3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 70.9 3rd Qu.: 2.000
## Max. :17640 Max. :1207359 Max. :119.0 Max. :33.000
##
## delinq.2yrs pub.rec not.fully.paid
## Min. : 0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 0.0000 Median :0.00000 Median :0.0000
## Mean : 0.1637 Mean :0.06212 Mean :0.1601
## 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :13.0000 Max. :5.00000 Max. :1.0000
##
Next, we convert the following columns to categorical data using factor() - inq.last.6mths - delinq.2yrs - pub.rec - not.fully.paid - credit.policy
loans$credit.policy <- factor(loans$credit.policy)
loans$inq.last.6mths <- factor(loans$inq.last.6mths)
loans$delinq.2yrs <- factor(loans$delinq.2yrs)
loans$pub.rec <- factor(loans$pub.rec)
loans$not.fully.paid <- factor(loans$not.fully.paid)
str(loans)
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : Factor w/ 28 levels "0","1","2","3",..: 1 1 2 2 1 1 1 1 2 2 ...
## $ delinq.2yrs : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 2 1 1 1 1 1 ...
## $ pub.rec : Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 2 1 1 1 ...
## $ not.fully.paid : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
Let’s use ggplot2 to visualize the data. We first create a historam of fico scores colored by not.fully.paid
library(ggplot2)
pl <- ggplot(loans,aes(x=fico))
pl <- pl + geom_histogram(aes(fill=not.fully.paid),color='light grey',bins=40,alpha=0.5)
pl + scale_fill_manual(values=c('green','red'))+theme_bw()
Then, we create a barplot of purpose counts, colored by not.fully.paid. Use position=dodge in the geom_bar argument.
pl <- ggplot(loans,aes(x=factor(purpose)))
pl <- pl+geom_bar(aes(fill=not.fully.paid),position="dodge")
pl+theme_bw()+theme(axis.text.x=element_text(angle=90,hjust=1))
We also create a scatterplot of fico score versus int.rate. Does the trend make sense? Play around with the color scheme if you want.
ggplot(loans,aes(x=int.rate,y=fico))+geom_point()+theme_bw()
ggplot(loans,aes(x=int.rate,y=fico))+geom_point(aes(color=not.fully.paid),alpha=0.5)+theme_bw()
Now its time to build a model.
We first split the data into training and test sets using the caTools library.
library(caTools)
sample <- sample.split(loans$not.fully.paid,SplitRatio=0.8)
loans_train <- subset(loans,sample==T)
loans_test <- subset(loans,sample=F)
We call the e1071 library as shown in the lecture. We now use the svm() function to train a model on the training dataset.
library(e1071)
# svm model application
loans_model <- svm(not.fully.paid~.,data=loans_train)
summary(loans_model)
##
## Call:
## svm(formula = not.fully.paid ~ ., data = loans_train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.01724138
##
## Number of Support Vectors: 3233
##
## ( 2007 1226 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
We could use predict to predict new values from the test set suing the model.
str(loans)
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : Factor w/ 28 levels "0","1","2","3",..: 1 1 2 2 1 1 1 1 2 2 ...
## $ delinq.2yrs : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 2 1 1 1 1 1 ...
## $ pub.rec : Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 2 1 1 1 ...
## $ not.fully.paid : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
# svm model prediction
pred.values <- predict(loans_model,loans_test[1:13])
table(pred.values,loans_test[,14])
##
## pred.values 0 1
## 0 8045 1533
## 1 0 0
table(pred.values,loans_test$not.fully.paid)
##
## pred.values 0 1
## 0 8045 1533
## 1 0 0
Tuning the model We probably got some bad results. With the model classifying everything into one group. Lets tune the model to fix this issue.
We use the tune() function to test the different cost and gamma values. In the lecture, we showed how to do this by using train.x and train.y, but its usually simpler to just pass a formula. Try checking out help(tune) for mode details. This is the end of the project because tuning can take a long time (since its running a bunch of different models).
tune.results <- tune(svm,train.x=not.fully.paid~.,data=loans_train,kernel='radial',ranges=list(cost=c(1,10),gamma=c(0.1,1)))
model <- svm(not.fully.paid~., data=loans_train,cost=10,gamma=0.1)
predicted.values <- predict(model,loans_test[1:13])
table(predicted.values,loans_test$not.fully.paid)
##
## predicted.values 0 1
## 0 8010 1133
## 1 35 400
Reference: ISLR Chapter 10
K-Means Clustering is an unsupervised learning algorithm that will attempt to group similar clusters together in the data.So what does a typical clustering problem liik like? - cluster similar documents - cluster customers based on features - market segmentation - identify similar physical groups
The K Means Algorithm - choose a number of clusters “K” - randomly assign each point to a cluster - until clusters stop chaging, repeat the following: (1) For each cluster, cpmpute the cluster centroid by taking the mean vector of points in the cluster (2) Assing each data point to the cluster for which the centroid is the closest
There is no easy answer for choosing a best K value, but one way is the elbow method. - First of all, compute the sum of squared error (SSE) for some values of k. - The SSE is defined as the sum of the squaired distance between each member of the cluster and its centroid.
If we plot k against the SSE, we will see that the error decreases as k gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller.
The idea of the elbow method is to choose the k at which the SSe decreases abruptly. This produces an “elbow effect” in the graph, as we can see in the following picture:
library(ISLR)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
library(ggplot2)
pl <- ggplot(iris,aes(Petal.Length,Petal.Width))+geom_point(aes(color=Species),size=4)
pl
We conduct K means clustering for iris dataset.
set.seed(101)
irisCluster <- kmeans(iris[,1:4],3,nstart=20)
irisCluster
## K-means clustering with 3 clusters of sizes 62, 50, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.901613 2.748387 4.393548 1.433871
## 2 5.006000 3.428000 1.462000 0.246000
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3
## [106] 3 1 3 3 3 3 3 3 1 1 3 3 3 3 1 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3
## [141] 3 3 1 3 3 3 1 3 3 1
##
## Within cluster sum of squares by cluster:
## [1] 39.82097 15.15100 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
When we evaluate K means clustering, we can simply use table() function. In addition, we can visualize such data, by cluster package.
table(irisCluster$cluster,iris$Species)
##
## setosa versicolor virginica
## 1 0 48 14
## 2 50 0 0
## 3 0 2 36
# data visualization
library(cluster)
clusplot(iris,irisCluster$cluster,color=T,shade=T,labels=0,
lines=0)